3D NAND flash memory with advanced multi-level cell techniques provides high storage density, but... more 3D NAND flash memory with advanced multi-level cell techniques provides high storage density, but suffers from significant performance degradation due to a large number of read-retry operations. Although the read-retry mechanism is essential to ensuring the reliability of modern NAND flash memory, it can significantly increase the read latency of an SSD by introducing multiple retry steps that read the target page again with adjusted read-reference voltage values. Through a detailed analysis of the read mechanism and rigorous characterization of 160 real 3D NAND flash memory chips, we find new opportunities to reduce the read-retry latency by exploiting two advanced features widely adopted in modern NAND flash-based SSDs: 1) the CACHE READ command and 2) strong ECC engine. First, we can reduce the read-retry latency using the advanced CACHE READ command that allows a NAND flash chip to perform consecutive reads in a pipelined manner. Second, there exists a large ECC-capability margin in the final retry step that can be used for reducing the chip-level read latency. Based on our new findings, we develop two new techniques that effectively reduce the read-retry latency: 1) Pipelined Read-Retry (PR 2) and 2) Adaptive Read-Retry (AR 2). PR 2 reduces the latency of a read-retry operation by pipelining consecutive retry steps using the CACHE READ command. AR 2 shortens the latency of each retry step by dynamically reducing the chip-level read latency depending on the current operating conditions that determine the ECC-capability margin. Our evaluation using twelve real-world workloads shows that our proposal improves SSD response time by up to 31.5% (17% on average) over a state-of-the-art baseline with only small changes to the SSD controller. CCS CONCEPTS • Hardware → External storage.
As data privacy and security rapidly become key requirements, securely erasing data from a storag... more As data privacy and security rapidly become key requirements, securely erasing data from a storage system becomes as important as reliably storing data in the system. Unfortunately, in modern flash-based storage systems, it is challenging to irrecoverably erase (i.e., sanitize) a file without large performance or reliability penalties. In this paper, we propose Evanesco, a new data sanitization technique specifically designed for high-density 3D NAND flash memory. Unlike existing techniques that physically destroy stored data, Evanesco provides data sanitization by blocking access to stored data. By exploiting existing spare flash cells in the flash memory chip, Evanesco efficiently supports two new flash lock commands (pLock and bLock) that disable access to deleted data at both page and block granularities. Since the locked page (or block) can be unlocked only after its data is erased, Evanesco provides a strong security guarantee even against an advanced threat model. To evaluate our technique, we build SecureSSD, an Evanesco-enabled emulated flash storage system. Our experimental results show that SecureSSD can effectively support data sanitization with a small performance overhead and no reliability degradation.
Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform... more Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform write access patterns. Two main techniques to mitigate endurance-related memory failures are 1) wear-leveling, to evenly distribute the writes across the entire memory, and 2) fault tolerance, to correct memory cell failures. However, one of the main open challenges in extending the lifetime of existing resistive memories is to make both techniques work together seamlessly and e ciently. To address this challenge, we propose WoLFRaM, a new mechanism that combines both wear-leveling and fault tolerance techniques at low cost by using a programmable resistive address decoder (PRAD). The key idea of WoLFRaM is to use PRAD for implementing 1) a new e cient wear-leveling mechanism that remaps write accesses to random physical locations on the y, and 2) a new e cient fault tolerance mechanism that recovers from faults by remapping failed memory blocks to available physical locations. Our evaluations show that, for a Phase Change Memory (PCM) based system with cell endurance of 10 8 writes, WoLFRaM increases the memory lifetime by 68% compared to a baseline that implements the best state-of-the-art wear-leveling and fault correction mechanisms. WoLFRaM's average / worst-case performance and energy overheads are 0.51% / 3.8% and 0.47% / 2.1% respectively.
Modern client processors typically use one of three commonlyused power delivery network (PDN) arc... more Modern client processors typically use one of three commonlyused power delivery network (PDN) architectures: 1) motherboard voltage regulators (MBVR), 2) integrated voltage regulators (IVR), and 3) low dropout voltage regulators (LDO). We observe that the energy-e ciency of each of these PDNs varies with the processor power (e.g., thermal design power (TDP) and dynamic power-state) and workload characteristics (e.g., workload type and computational intensity). This leads to energyine ciency and performance loss, as modern client processors operate across a wide spectrum of power consumption and execute a wide variety of workloads.
DRAM manufacturers have been prioritizing memory capacity, yield, and bandwidth for years, while ... more DRAM manufacturers have been prioritizing memory capacity, yield, and bandwidth for years, while trying to keep the design complexity as simple as possible. DRAM chips are passive elements that store data, but they do not carry out any computation or other important function in the system, such as security. Processors implement and execute most of the existing security mechanisms that protect the system against security threats, because 1) executing security mechanisms usually require non-trivial computational capabilities (e.g., encryption), and 2) commodity DRAM chips are not designed to perform computations or tasks other than data storage. In this work, we advocate for DRAM as a key component for providing security mechanisms to the system. To this end, we propose Dataplant, a new class of low-cost, highperformance, and reliable security primitives that can be integrated in commodity DRAM chips with minimal changes. The main idea of Dataplant is to slightly modify the internal DRAM timing signals to expose the inherent process variation found in all DRAM chips for generating unpredictable but reproducible values (e.g., keys) within DRAM, without affecting regular DRAM operation. We use Dataplant to build two new security mechanisms. First, a new Dataplant-based physical unclonable function (PUF) with non-destructive read-out, low evaluation latency, robust responses, resiliency to temperature changes, and data-independent responses. Second, a new cold boot attack prevention mechanism based on Dataplant that automatically destroys all data within DRAM on every power cycle with zero run-time energy and latency overheads. These mechanisms can be integrated with current DDR memory chips without changing the DRAM array. Using a combination of detailed simulations and experiments with 136 real commodity DRAM chips, we show that our Dataplant-based PUF has 1.8x higher throughput than the best state-of-the-art DRAM PUFs while being more resilient to temperature changes, and totally independent of the values stored in memory. We also demonstrate that our Dataplant-based cold boot attack protection mechanism is 19.5x faster and consumes 2.54x less energy when compared to existing mechanisms.
DRAM is the prevalent main memory technology, but its long access latency can limit the performan... more DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-o is xed at design time, previous works cannot achieve maximum performance under very di erent and dynamic workload demands. This paper proposes Capacity-Latency-Recon gurable DRAM (CLR-DRAM), a new DRAM architecture that enables dynamic capacity-latency trade-o at low cost. CLR-DRAM allows dynamic recon guration of any DRAM row to switch between two operating modes: 1) max-capacity mode, where every DRAM cell operates individually to achieve approximately the same storage density as a density-optimized commodity DRAM chip and 2) high-performance mode, where two adjacent DRAM cells in a DRAM row and their sense ampli ers are coupled to operate as a single low-latency logical cell driven by a single logical sense ampli er. We implement CLR-DRAM by adding isolation transistors in each DRAM subarray. Our evaluations show that CLR-DRAM can improve system performance and DRAM energy consumption by 18.6% and 29.7% on average with four-core multiprogrammed workloads. We believe that CLR-DRAM opens new research directions for a system to adapt to the diverse and dynamically changing memory capacity and access latency demands of workloads.
DRAM is the building block of modern main memory systems. DRAM cells must be periodically refresh... more DRAM is the building block of modern main memory systems. DRAM cells must be periodically refreshed to prevent data loss. Refresh operations degrade system performance by interfering with memory accesses. As DRAM chip density increases with technology node scaling, refresh operations also increase because: 1) the number of DRAM rows in a chip increases; and 2) DRAM cells need additional refresh operations to mitigate bit failures caused by RowHammer, a failure mechanism that becomes worse with technology node scaling. Thus, it is critical to enable refresh operations at low performance overhead. To this end, we propose a new operation, Hidden Row Activation (HiRA), and the HiRA Memory Controller (HiRA-MC) to perform HiRA operations. HiRA hides a refresh operation's latency by refreshing a row concurrently with accessing or refreshing another row within the same bank. Unlike prior works, HiRA achieves this parallelism without any modifications to off-the-shelf DRAM chips. To do so, it leverages the new observation that two rows in the same bank can be activated without data loss if the rows are connected to different charge restoration circuitry. We experimentally demonstrate on 56% real off-the-shelf DRAM chips that HiRA can reliably parallelize a DRAM row's refresh operation with refresh or activation of any of the 32% of the rows within the same bank. By doing so, HiRA reduces the overall latency of two refresh operations by 51.4%. HiRA-MC modifies the memory request scheduler to perform HiRA when a refresh operation can be performed concurrently with a memory access or another refresh. Our system-level evaluations show that HiRA-MC increases system performance by 12.6% and 3.73× as it reduces the performance degradation due to periodic refreshes and refreshes for RowHammer protection (preventive refreshes), respectively, for future DRAM chips with increased density and RowHammer vulnerability.
DRAM is the dominant main memory technology used in modern computing systems. Computing systems i... more DRAM is the dominant main memory technology used in modern computing systems. Computing systems implement a memory controller that interfaces with DRAM via DRAM commands. DRAM executes the given commands using internal components (e.g., access transistors, sense amplifiers) that are orchestrated by DRAM internal timings, which are fixed for each DRAM command. Unfortunately, the use of fixed internal timings limits the types of operations that DRAM can perform and hinders the implementation of new functionalities and custom mechanisms that improve DRAM reliability, performance and energy. To overcome these limitations, we propose enabling programmable DRAM internal timings for controlling in-DRAM components. To this end, we design CODIC, a new low-cost DRAM substrate that enables fine-grained control over four previously fixed internal DRAM timings that are key to many DRAM operations. We implement CODIC with only minimal changes to the DRAM chip and the DDRx interface. To demonstrate the potential of CODIC, we propose two new CODIC-based security mechanisms that outperform state-of-the-art mechanisms in several ways: (1) a new DRAM Physical Unclonable Function (PUF) that is more robust and has significantly higher throughput than state-of-the-art DRAM PUFs, and (2) the first cold boot attack prevention mechanism that does not introduce any performance or energy overheads at runtime.
To understand and improve DRAM performance, reliability, security, and energy efficiency, prior w... more To understand and improve DRAM performance, reliability, security, and energy efficiency, prior works study characteristics of commodity DRAM chips. Unfortunately, state-of-the-art open source infrastructures capable of conducting such studies are obsolete, poorly supported, or difficult to use, or their inflexibility limits the types of studies they can conduct. We propose DRAM Bender, a new FPGA-based infrastructure that enables experimental studies on state-of-the-art DRAM chips. DRAM Bender offers three key features at the same time. First, DRAM Bender enables directly interfacing with a DRAM chip through its low-level interface. This allows users to issue DRAM commands in arbitrary order and with finer-grained time intervals compared to other open source infrastructures. Second, DRAM Bender exposes easy-to-use C++ and Python programming interfaces, allowing users to quickly and easily develop different types of DRAM experiments. Third, DRAM Bender is easily extensible. The modular design of DRAM Bender allows extending it to (i) support existing and emerging DRAM interfaces, and (ii) run on new commercial or custom FPGA boards with little effort. To demonstrate that DRAM Bender is a versatile infrastructure, we conduct three case studies, two of which lead to new observations about the DRAM RowHammer vulnerability. In particular, we show that data patterns supported by DRAM Bender uncover a larger set of bit-flips on a victim row than those commonly used by prior work. We demonstrate the extensibility of DRAM Bender by implementing it on five different FPGAs with DDR4 and DDR3 support.
2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)
DRAM is the building block of modern main memory systems. DRAM cells must be periodically refresh... more DRAM is the building block of modern main memory systems. DRAM cells must be periodically refreshed to prevent data loss. Refresh operations degrade system performance by interfering with memory accesses. As DRAM chip density increases with technology node scaling, refresh operations also increase because: 1) the number of DRAM rows in a chip increases; and 2) DRAM cells need additional refresh operations to mitigate bit failures caused by RowHammer, a failure mechanism that becomes worse with technology node scaling. Thus, it is critical to enable refresh operations at low performance overhead. To this end, we propose a new operation, Hidden Row Activation (HiRA), and the HiRA Memory Controller (HiRA-MC) to perform HiRA operations. HiRA hides a refresh operation's latency by refreshing a row concurrently with accessing or refreshing another row within the same bank. Unlike prior works, HiRA achieves this parallelism without any modifications to off-the-shelf DRAM chips. To do so, it leverages the new observation that two rows in the same bank can be activated without data loss if the rows are connected to different charge restoration circuitry. We experimentally demonstrate on 56% real off-the-shelf DRAM chips that HiRA can reliably parallelize a DRAM row's refresh operation with refresh or activation of any of the 32% of the rows within the same bank. By doing so, HiRA reduces the overall latency of two refresh operations by 51.4%. HiRA-MC modifies the memory request scheduler to perform HiRA when a refresh operation can be performed concurrently with a memory access or another refresh. Our system-level evaluations show that HiRA-MC increases system performance by 12.6% and 3.73× as it reduces the performance degradation due to periodic refreshes and refreshes for RowHammer protection (preventive refreshes), respectively, for future DRAM chips with increased density and RowHammer vulnerability.
2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)
Data movement between main memory and the processor is a key contributor to the execution time an... more Data movement between main memory and the processor is a key contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high throughput and e ciency, but supports a limited range of operations. As a result, PuM architectures cannot e ciently perform some complex operations (e.g., multiplication, division, exponentiation) without sizeable increases in chip area and design complexity. To overcome this limitation in DRAM-based PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high area density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The use of LUTs enables pLUTo to e ciently execute complex operations in-memory via memory reads (i.e., LUT queries) instead of relying on complex extra logic or performing long sequences of DRAM commands. pLUTo outperforms the optimized CPU and GPU baselines in performance/energy eciency by an average of 1960×/307× and 4.2×/4× across the evaluated workloads, and by 33×/8× and 110×/80× for the LeNet-5 quantized neural network. pLUTo outperforms a state-of-the-art PiM baseline by 50×/342× in performance/energy e ciency.
2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
Random number generation is an important task in a wide variety of critical applications includin... more Random number generation is an important task in a wide variety of critical applications including cryptographic algorithms, scientific simulations, and industrial testing tools. True Random Number Generators (TRNGs) produce cryptographicallysecure truly random data by sampling a physical entropy source that typically requires custom hardware and suffers from long latency. To enable high-bandwidth and low-latency TRNGs on widely-available commodity devices, recent works propose hardware TRNGs that generate random numbers using commodity DRAM as an entropy source. Although prior works demonstrate promising TRNG mechanisms using DRAM, practical integration of such mechanisms into real systems poses various challenges. We identify three key challenges for using DRAM-based TRNGs in current systems: (1) generating random numbers with DRAM-based TRNGs can degrade overall system performance by slowing down concurrently-running applications due to the interference between RNG and regular memory operations in the memory controller (i.e., RNG interference), (2) this RNG interference can degrade system fairness by causing unfair prioritization of applications that intensively use random numbers (i.e., RNG applications), and (3) RNG applications can experience significant slowdown due to the high latency of DRAM-based TRNGs. To address these challenges, we propose DR-STRaNGe, an end-to-end system design for DRAM-based TRNGs that (1) reduces the RNG interference by separating RNG requests from regular memory requests in the memory controller, (2) improves fairness across applications with an RNG-aware memory request scheduler, and (3) hides the large TRNG latencies using a random number buffering mechanism combined with a new DRAM idleness predictor that accurately identifies idle DRAM periods. We evaluate DR-STRaNGe using a comprehensive set of 186 multi-programmed workloads. Compared to an RNG-oblivious baseline system, DR-STRaNGe improves the performance of non-RNG and RNG applications on average by 17.9% and 25.1%, respectively. DR-STRaNGe improves system fairness by 32.1% on average when generating random numbers at a 5 Gb/s throughput. DR-STRaNGe reduces energy consumption by 21% compared to the RNG-oblivious baseline design by reducing the time spent for RNG and non-RNG memory accesses by 15.8%.
2020 IEEE 38th International Conference on Computer Design (ICCD), 2020
Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform... more Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform write access patterns. Two main techniques to mitigate endurance-related memory failures are 1) wear-leveling, to evenly distribute the writes across the entire memory, and 2) fault tolerance, to correct memory cell failures. However, one of the main open challenges in extending the lifetime of existing resistive memories is to make both techniques work together seamlessly and e ciently. To address this challenge, we propose WoLFRaM, a new mechanism that combines both wear-leveling and fault tolerance techniques at low cost by using a programmable resistive address decoder (PRAD). The key idea of WoLFRaM is to use PRAD for implementing 1) a new e cient wear-leveling mechanism that remaps write accesses to random physical locations on the y, and 2) a new e cient fault tolerance mechanism that recovers from faults by remapping failed memory blocks to available physical locations. Our evaluations show that, for a Phase Change Memory (PCM) based system with cell endurance of 10 8 writes, WoLFRaM increases the memory lifetime by 68% compared to a baseline that implements the best state-of-the-art wear-leveling and fault correction mechanisms. WoLFRaM's average / worst-case performance and energy overheads are 0.51% / 3.8% and 0.47% / 2.1% respectively.
Low-cost devices are now commonplace and can be found in diverse environments such as home electr... more Low-cost devices are now commonplace and can be found in diverse environments such as home electronics or cars. As such devices are ubiquitous and portable, manufacturers strive to minimize their cost and power consumption by eliminating many features that exist in other computing systems. Unfortunately, this results in a lack of basic security features, even though their ubiquity and ease of physical access make these devices particularly vulnerable to attacks. To address the lack of security mechanisms in low-cost devices that make use of DRAM, we propose Dataplant, a new set of low-cost, high-performance, and reliable security primitives that reside in and make use of commodity DRAM chips. The main idea of Dataplant is to slightly modify the internal DRAM timing signals to expose the inherent process variation found in all DRAM chips for generating unpredictable but reproducible values (e.g., keys, seeds, signatures) within DRAM, without affecting regular DRAM operation. We use D...
3D NAND flash memory with advanced multi-level cell techniques provides high storage density, but... more 3D NAND flash memory with advanced multi-level cell techniques provides high storage density, but suffers from significant performance degradation due to a large number of read-retry operations. Although the read-retry mechanism is essential to ensuring the reliability of modern NAND flash memory, it can significantly increase the read latency of an SSD by introducing multiple retry steps that read the target page again with adjusted read-reference voltage values. Through a detailed analysis of the read mechanism and rigorous characterization of 160 real 3D NAND flash memory chips, we find new opportunities to reduce the read-retry latency by exploiting two advanced features widely adopted in modern NAND flash-based SSDs: 1) the CACHE READ command and 2) strong ECC engine. First, we can reduce the read-retry latency using the advanced CACHE READ command that allows a NAND flash chip to perform consecutive reads in a pipelined manner. Second, there exists a large ECC-capability margin in the final retry step that can be used for reducing the chip-level read latency. Based on our new findings, we develop two new techniques that effectively reduce the read-retry latency: 1) Pipelined Read-Retry (PR 2) and 2) Adaptive Read-Retry (AR 2). PR 2 reduces the latency of a read-retry operation by pipelining consecutive retry steps using the CACHE READ command. AR 2 shortens the latency of each retry step by dynamically reducing the chip-level read latency depending on the current operating conditions that determine the ECC-capability margin. Our evaluation using twelve real-world workloads shows that our proposal improves SSD response time by up to 31.5% (17% on average) over a state-of-the-art baseline with only small changes to the SSD controller. CCS CONCEPTS • Hardware → External storage.
As data privacy and security rapidly become key requirements, securely erasing data from a storag... more As data privacy and security rapidly become key requirements, securely erasing data from a storage system becomes as important as reliably storing data in the system. Unfortunately, in modern flash-based storage systems, it is challenging to irrecoverably erase (i.e., sanitize) a file without large performance or reliability penalties. In this paper, we propose Evanesco, a new data sanitization technique specifically designed for high-density 3D NAND flash memory. Unlike existing techniques that physically destroy stored data, Evanesco provides data sanitization by blocking access to stored data. By exploiting existing spare flash cells in the flash memory chip, Evanesco efficiently supports two new flash lock commands (pLock and bLock) that disable access to deleted data at both page and block granularities. Since the locked page (or block) can be unlocked only after its data is erased, Evanesco provides a strong security guarantee even against an advanced threat model. To evaluate our technique, we build SecureSSD, an Evanesco-enabled emulated flash storage system. Our experimental results show that SecureSSD can effectively support data sanitization with a small performance overhead and no reliability degradation.
Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform... more Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform write access patterns. Two main techniques to mitigate endurance-related memory failures are 1) wear-leveling, to evenly distribute the writes across the entire memory, and 2) fault tolerance, to correct memory cell failures. However, one of the main open challenges in extending the lifetime of existing resistive memories is to make both techniques work together seamlessly and e ciently. To address this challenge, we propose WoLFRaM, a new mechanism that combines both wear-leveling and fault tolerance techniques at low cost by using a programmable resistive address decoder (PRAD). The key idea of WoLFRaM is to use PRAD for implementing 1) a new e cient wear-leveling mechanism that remaps write accesses to random physical locations on the y, and 2) a new e cient fault tolerance mechanism that recovers from faults by remapping failed memory blocks to available physical locations. Our evaluations show that, for a Phase Change Memory (PCM) based system with cell endurance of 10 8 writes, WoLFRaM increases the memory lifetime by 68% compared to a baseline that implements the best state-of-the-art wear-leveling and fault correction mechanisms. WoLFRaM's average / worst-case performance and energy overheads are 0.51% / 3.8% and 0.47% / 2.1% respectively.
Modern client processors typically use one of three commonlyused power delivery network (PDN) arc... more Modern client processors typically use one of three commonlyused power delivery network (PDN) architectures: 1) motherboard voltage regulators (MBVR), 2) integrated voltage regulators (IVR), and 3) low dropout voltage regulators (LDO). We observe that the energy-e ciency of each of these PDNs varies with the processor power (e.g., thermal design power (TDP) and dynamic power-state) and workload characteristics (e.g., workload type and computational intensity). This leads to energyine ciency and performance loss, as modern client processors operate across a wide spectrum of power consumption and execute a wide variety of workloads.
DRAM manufacturers have been prioritizing memory capacity, yield, and bandwidth for years, while ... more DRAM manufacturers have been prioritizing memory capacity, yield, and bandwidth for years, while trying to keep the design complexity as simple as possible. DRAM chips are passive elements that store data, but they do not carry out any computation or other important function in the system, such as security. Processors implement and execute most of the existing security mechanisms that protect the system against security threats, because 1) executing security mechanisms usually require non-trivial computational capabilities (e.g., encryption), and 2) commodity DRAM chips are not designed to perform computations or tasks other than data storage. In this work, we advocate for DRAM as a key component for providing security mechanisms to the system. To this end, we propose Dataplant, a new class of low-cost, highperformance, and reliable security primitives that can be integrated in commodity DRAM chips with minimal changes. The main idea of Dataplant is to slightly modify the internal DRAM timing signals to expose the inherent process variation found in all DRAM chips for generating unpredictable but reproducible values (e.g., keys) within DRAM, without affecting regular DRAM operation. We use Dataplant to build two new security mechanisms. First, a new Dataplant-based physical unclonable function (PUF) with non-destructive read-out, low evaluation latency, robust responses, resiliency to temperature changes, and data-independent responses. Second, a new cold boot attack prevention mechanism based on Dataplant that automatically destroys all data within DRAM on every power cycle with zero run-time energy and latency overheads. These mechanisms can be integrated with current DDR memory chips without changing the DRAM array. Using a combination of detailed simulations and experiments with 136 real commodity DRAM chips, we show that our Dataplant-based PUF has 1.8x higher throughput than the best state-of-the-art DRAM PUFs while being more resilient to temperature changes, and totally independent of the values stored in memory. We also demonstrate that our Dataplant-based cold boot attack protection mechanism is 19.5x faster and consumes 2.54x less energy when compared to existing mechanisms.
DRAM is the prevalent main memory technology, but its long access latency can limit the performan... more DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-o is xed at design time, previous works cannot achieve maximum performance under very di erent and dynamic workload demands. This paper proposes Capacity-Latency-Recon gurable DRAM (CLR-DRAM), a new DRAM architecture that enables dynamic capacity-latency trade-o at low cost. CLR-DRAM allows dynamic recon guration of any DRAM row to switch between two operating modes: 1) max-capacity mode, where every DRAM cell operates individually to achieve approximately the same storage density as a density-optimized commodity DRAM chip and 2) high-performance mode, where two adjacent DRAM cells in a DRAM row and their sense ampli ers are coupled to operate as a single low-latency logical cell driven by a single logical sense ampli er. We implement CLR-DRAM by adding isolation transistors in each DRAM subarray. Our evaluations show that CLR-DRAM can improve system performance and DRAM energy consumption by 18.6% and 29.7% on average with four-core multiprogrammed workloads. We believe that CLR-DRAM opens new research directions for a system to adapt to the diverse and dynamically changing memory capacity and access latency demands of workloads.
DRAM is the building block of modern main memory systems. DRAM cells must be periodically refresh... more DRAM is the building block of modern main memory systems. DRAM cells must be periodically refreshed to prevent data loss. Refresh operations degrade system performance by interfering with memory accesses. As DRAM chip density increases with technology node scaling, refresh operations also increase because: 1) the number of DRAM rows in a chip increases; and 2) DRAM cells need additional refresh operations to mitigate bit failures caused by RowHammer, a failure mechanism that becomes worse with technology node scaling. Thus, it is critical to enable refresh operations at low performance overhead. To this end, we propose a new operation, Hidden Row Activation (HiRA), and the HiRA Memory Controller (HiRA-MC) to perform HiRA operations. HiRA hides a refresh operation's latency by refreshing a row concurrently with accessing or refreshing another row within the same bank. Unlike prior works, HiRA achieves this parallelism without any modifications to off-the-shelf DRAM chips. To do so, it leverages the new observation that two rows in the same bank can be activated without data loss if the rows are connected to different charge restoration circuitry. We experimentally demonstrate on 56% real off-the-shelf DRAM chips that HiRA can reliably parallelize a DRAM row's refresh operation with refresh or activation of any of the 32% of the rows within the same bank. By doing so, HiRA reduces the overall latency of two refresh operations by 51.4%. HiRA-MC modifies the memory request scheduler to perform HiRA when a refresh operation can be performed concurrently with a memory access or another refresh. Our system-level evaluations show that HiRA-MC increases system performance by 12.6% and 3.73× as it reduces the performance degradation due to periodic refreshes and refreshes for RowHammer protection (preventive refreshes), respectively, for future DRAM chips with increased density and RowHammer vulnerability.
DRAM is the dominant main memory technology used in modern computing systems. Computing systems i... more DRAM is the dominant main memory technology used in modern computing systems. Computing systems implement a memory controller that interfaces with DRAM via DRAM commands. DRAM executes the given commands using internal components (e.g., access transistors, sense amplifiers) that are orchestrated by DRAM internal timings, which are fixed for each DRAM command. Unfortunately, the use of fixed internal timings limits the types of operations that DRAM can perform and hinders the implementation of new functionalities and custom mechanisms that improve DRAM reliability, performance and energy. To overcome these limitations, we propose enabling programmable DRAM internal timings for controlling in-DRAM components. To this end, we design CODIC, a new low-cost DRAM substrate that enables fine-grained control over four previously fixed internal DRAM timings that are key to many DRAM operations. We implement CODIC with only minimal changes to the DRAM chip and the DDRx interface. To demonstrate the potential of CODIC, we propose two new CODIC-based security mechanisms that outperform state-of-the-art mechanisms in several ways: (1) a new DRAM Physical Unclonable Function (PUF) that is more robust and has significantly higher throughput than state-of-the-art DRAM PUFs, and (2) the first cold boot attack prevention mechanism that does not introduce any performance or energy overheads at runtime.
To understand and improve DRAM performance, reliability, security, and energy efficiency, prior w... more To understand and improve DRAM performance, reliability, security, and energy efficiency, prior works study characteristics of commodity DRAM chips. Unfortunately, state-of-the-art open source infrastructures capable of conducting such studies are obsolete, poorly supported, or difficult to use, or their inflexibility limits the types of studies they can conduct. We propose DRAM Bender, a new FPGA-based infrastructure that enables experimental studies on state-of-the-art DRAM chips. DRAM Bender offers three key features at the same time. First, DRAM Bender enables directly interfacing with a DRAM chip through its low-level interface. This allows users to issue DRAM commands in arbitrary order and with finer-grained time intervals compared to other open source infrastructures. Second, DRAM Bender exposes easy-to-use C++ and Python programming interfaces, allowing users to quickly and easily develop different types of DRAM experiments. Third, DRAM Bender is easily extensible. The modular design of DRAM Bender allows extending it to (i) support existing and emerging DRAM interfaces, and (ii) run on new commercial or custom FPGA boards with little effort. To demonstrate that DRAM Bender is a versatile infrastructure, we conduct three case studies, two of which lead to new observations about the DRAM RowHammer vulnerability. In particular, we show that data patterns supported by DRAM Bender uncover a larger set of bit-flips on a victim row than those commonly used by prior work. We demonstrate the extensibility of DRAM Bender by implementing it on five different FPGAs with DDR4 and DDR3 support.
2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)
DRAM is the building block of modern main memory systems. DRAM cells must be periodically refresh... more DRAM is the building block of modern main memory systems. DRAM cells must be periodically refreshed to prevent data loss. Refresh operations degrade system performance by interfering with memory accesses. As DRAM chip density increases with technology node scaling, refresh operations also increase because: 1) the number of DRAM rows in a chip increases; and 2) DRAM cells need additional refresh operations to mitigate bit failures caused by RowHammer, a failure mechanism that becomes worse with technology node scaling. Thus, it is critical to enable refresh operations at low performance overhead. To this end, we propose a new operation, Hidden Row Activation (HiRA), and the HiRA Memory Controller (HiRA-MC) to perform HiRA operations. HiRA hides a refresh operation's latency by refreshing a row concurrently with accessing or refreshing another row within the same bank. Unlike prior works, HiRA achieves this parallelism without any modifications to off-the-shelf DRAM chips. To do so, it leverages the new observation that two rows in the same bank can be activated without data loss if the rows are connected to different charge restoration circuitry. We experimentally demonstrate on 56% real off-the-shelf DRAM chips that HiRA can reliably parallelize a DRAM row's refresh operation with refresh or activation of any of the 32% of the rows within the same bank. By doing so, HiRA reduces the overall latency of two refresh operations by 51.4%. HiRA-MC modifies the memory request scheduler to perform HiRA when a refresh operation can be performed concurrently with a memory access or another refresh. Our system-level evaluations show that HiRA-MC increases system performance by 12.6% and 3.73× as it reduces the performance degradation due to periodic refreshes and refreshes for RowHammer protection (preventive refreshes), respectively, for future DRAM chips with increased density and RowHammer vulnerability.
2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)
Data movement between main memory and the processor is a key contributor to the execution time an... more Data movement between main memory and the processor is a key contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high throughput and e ciency, but supports a limited range of operations. As a result, PuM architectures cannot e ciently perform some complex operations (e.g., multiplication, division, exponentiation) without sizeable increases in chip area and design complexity. To overcome this limitation in DRAM-based PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high area density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The use of LUTs enables pLUTo to e ciently execute complex operations in-memory via memory reads (i.e., LUT queries) instead of relying on complex extra logic or performing long sequences of DRAM commands. pLUTo outperforms the optimized CPU and GPU baselines in performance/energy eciency by an average of 1960×/307× and 4.2×/4× across the evaluated workloads, and by 33×/8× and 110×/80× for the LeNet-5 quantized neural network. pLUTo outperforms a state-of-the-art PiM baseline by 50×/342× in performance/energy e ciency.
2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
Random number generation is an important task in a wide variety of critical applications includin... more Random number generation is an important task in a wide variety of critical applications including cryptographic algorithms, scientific simulations, and industrial testing tools. True Random Number Generators (TRNGs) produce cryptographicallysecure truly random data by sampling a physical entropy source that typically requires custom hardware and suffers from long latency. To enable high-bandwidth and low-latency TRNGs on widely-available commodity devices, recent works propose hardware TRNGs that generate random numbers using commodity DRAM as an entropy source. Although prior works demonstrate promising TRNG mechanisms using DRAM, practical integration of such mechanisms into real systems poses various challenges. We identify three key challenges for using DRAM-based TRNGs in current systems: (1) generating random numbers with DRAM-based TRNGs can degrade overall system performance by slowing down concurrently-running applications due to the interference between RNG and regular memory operations in the memory controller (i.e., RNG interference), (2) this RNG interference can degrade system fairness by causing unfair prioritization of applications that intensively use random numbers (i.e., RNG applications), and (3) RNG applications can experience significant slowdown due to the high latency of DRAM-based TRNGs. To address these challenges, we propose DR-STRaNGe, an end-to-end system design for DRAM-based TRNGs that (1) reduces the RNG interference by separating RNG requests from regular memory requests in the memory controller, (2) improves fairness across applications with an RNG-aware memory request scheduler, and (3) hides the large TRNG latencies using a random number buffering mechanism combined with a new DRAM idleness predictor that accurately identifies idle DRAM periods. We evaluate DR-STRaNGe using a comprehensive set of 186 multi-programmed workloads. Compared to an RNG-oblivious baseline system, DR-STRaNGe improves the performance of non-RNG and RNG applications on average by 17.9% and 25.1%, respectively. DR-STRaNGe improves system fairness by 32.1% on average when generating random numbers at a 5 Gb/s throughput. DR-STRaNGe reduces energy consumption by 21% compared to the RNG-oblivious baseline design by reducing the time spent for RNG and non-RNG memory accesses by 15.8%.
2020 IEEE 38th International Conference on Computer Design (ICCD), 2020
Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform... more Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform write access patterns. Two main techniques to mitigate endurance-related memory failures are 1) wear-leveling, to evenly distribute the writes across the entire memory, and 2) fault tolerance, to correct memory cell failures. However, one of the main open challenges in extending the lifetime of existing resistive memories is to make both techniques work together seamlessly and e ciently. To address this challenge, we propose WoLFRaM, a new mechanism that combines both wear-leveling and fault tolerance techniques at low cost by using a programmable resistive address decoder (PRAD). The key idea of WoLFRaM is to use PRAD for implementing 1) a new e cient wear-leveling mechanism that remaps write accesses to random physical locations on the y, and 2) a new e cient fault tolerance mechanism that recovers from faults by remapping failed memory blocks to available physical locations. Our evaluations show that, for a Phase Change Memory (PCM) based system with cell endurance of 10 8 writes, WoLFRaM increases the memory lifetime by 68% compared to a baseline that implements the best state-of-the-art wear-leveling and fault correction mechanisms. WoLFRaM's average / worst-case performance and energy overheads are 0.51% / 3.8% and 0.47% / 2.1% respectively.
Low-cost devices are now commonplace and can be found in diverse environments such as home electr... more Low-cost devices are now commonplace and can be found in diverse environments such as home electronics or cars. As such devices are ubiquitous and portable, manufacturers strive to minimize their cost and power consumption by eliminating many features that exist in other computing systems. Unfortunately, this results in a lack of basic security features, even though their ubiquity and ease of physical access make these devices particularly vulnerable to attacks. To address the lack of security mechanisms in low-cost devices that make use of DRAM, we propose Dataplant, a new set of low-cost, high-performance, and reliable security primitives that reside in and make use of commodity DRAM chips. The main idea of Dataplant is to slightly modify the internal DRAM timing signals to expose the inherent process variation found in all DRAM chips for generating unpredictable but reproducible values (e.g., keys, seeds, signatures) within DRAM, without affecting regular DRAM operation. We use D...
Uploads
Papers by Lois Orosa