Speculative Execution Research Papers

Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance... more

The Cydra 5 is a VLIW minisupercomputer with hardware designed to accelerate a broad class of inner loops, presenting unique challenges to its compilers. We discuss the organization of its Fortran/77 compiler and several of the key... more

Improving architectural energy efficiency is important to address diminishing energy efficiency gains from technology scaling. At the same time, limiting hardware complexity is also important. This paper presents a new processor... more

Over the last 20 years, the open-source community has provided more and more software on which the world's highperformance computing systems depend for performance and productivity. The community has invested millions of dollars and years... more

an NSF Graduate Research Fellowship and NSF and Darpa grants to the Fugu and Raw projects. While provided a vital support network. Most of all, I have relied on my wife, Kathleen Shannon, and my children, Karissa and Anya. Their love has... more

The emergence and wide adoption of web applications have moved the client-side component, often written in JavaScript, to the forefront of computing on the web. Web application developers try to move more computation to the client side to... more

This paper proposes a new hardware technique for us-ing one core of a CMP to prefetch data for a thread run-ning on another core. Our approach simply executes a copy of all non-control instructions in the prefetching core af-ter they have... more

The AMD-K6 MMX-enabled processor is plugcompatible with the industry-standard Socket 7 and is binary compatible with the existing base of legacy X86 software. The microarchitecture is based on an out-of-order, superscalar execution engine... more

Current microprocessors utilise the instruction-level parallelism by a deep processor pipeline and the superscalar instruction issue technique. VLSI technology offers several solutions for aggressive exploitation of the instruction-level... more

Current microprocessors utilise the instruction-level parallelism by a deep processor pipeline and the superscalar instruction issue technique. VLSI technology offers several solutions for aggressive exploitation of the instruction-level parallelism in future generations of microprocessors. Technological advances will replace the gate delay by on-chip wire delay as the main obstacle to increase the chip complexity and cycle rate. The implication for the microarchitecture is that functionally partitioned designs with strict nearest neighbour connections must be developed. Among the major problems facing the microprocessor designers is the application of even higher degree of speculation in combination with functional partitioning of the processor, which prepares the way for exceeding the classical dataflow limit imposed by data dependences. In this paper we survey the current approaches to solving this problem, in particular we analyse several new research directions whose solutions are based on the complex uniprocessor architecture. A uniprocessor chip features a very aggressive superscalar design combined with a trace cache and superspeculative techniques. Superspeculative techniques exceed the classical dataflow limit where even with unlimited machine resources a program cannot execute any faster than the execution of the longest dependence chain introduced by the program's data dependences. Superspeculative processors also speculate about control dependences. The trace cache stores the dynamic instruction traces contiguously and fetches instructions from the trace cache rather than from the instruction cache. Since a dynamic trace of instructions may contain multiple taken branches, there is no need to fetch from multiple targets, as would be necessary when predicting multiple branches and fetching 16 or 32 instructions from the instruction cache. Multiscalar and trace processors define several processing cores that speculatively execute different parts of a sequential program in parallel. Multiscalar processors use a compiler to partition the program segments, whereas a trace processor uses a trace cache to generate dynamically trace segments for the processing cores. A datascalar processor runs the same sequential program redundantly on several processing elements where each processing element has different data set. This paper discusses and compares the performance potential of these complex uniprocessors. ᭧

The paper presents an approach helping developers to maintain source code identifiers and comments consistent with high-level artifacts. Specifically the approach computes and shows the textual similarity between source code and related... more

Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path. However, reducing latency and storage... more

Bookmark
Download
- by Sparsh Mittal
- •
- 6
  Computer Architecture, Computer Engineering, Literature Review, Survey

Early web content was expressed statically, making it amenable to straightforward prefetching to reduce user- perceived network delay. In contrast, today's rich web applications often hide content behind JavaScript event handlers,... more

Bookmark
Download
- by Jeremy Elson
- •
- 3
  Speculative Execution, Low Latency, Web Browsing

To narrow the widening gap between processor and memory performance, the authors propose improving the cache locality of pointer-manipulating programs and bolstering performance by careful placement of structure elements.

Bookmark
Download
- by Mark Hill
- •
- 19
  Scheduling, OPERATING SYSTEM, Programming, Data Structure

Designers face many choices when planning a new high-performance, general purpose microprocessor. Options include superscalar organization (the ability to dispatch and execute more than one instruction at a time), out-of-order issue of... more

The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more... more

Bookmark
Download
- by Woody Lichtenstein
- •
- 12
  Distributed Computing, Scheduling, Compiler, Performance Analysis

Improving MapReduce Performance in Heterogeneous Environments. ... If a node crashes, MapReduce re-runs its tasks on a different machine. ...

Bookmark
Download
- by Richa Jain
- •
- 20
  Harmonic Analysis, Data Mining, Scheduling, Resource Allocation

Performance of multithreaded programs is heavily influenced by the latencies of the thread management and synchronization operations. Improving these latencies becomes especially important when the parallelization is performed at fine... more

In modern superscalar microarchitectures that speculatively execute a great quantity of code, without performing branch prediction, it won't be possible to aggressively exploit instruction level parallelism from programs. Both the... more

As the di erence in speed between processor and memory system continues to increase, it is becoming crucial to develop and re ne techniques that enhance the e ectiveness of cache hierarchies. Two such techniques are data prefetching and... more

Bookmark
Download
- by Josep Torrellas
- •
- 2
  Shared memory, Speculative Execution

In modern superscalar microarchitectures that speculatively execute a great quantity of code, without performing branch prediction, it won't be possible to aggressively exploit program's instruction level parallelism. Both the... more

Bookmark
Download
- by C. Radu and +3
  Adrian Florea
  Horia Calborean
  A. Gellert
- •
- 6
  Computer Architecture, Information Technology, Speculative Execution, Interactive graphics

Performance of multithreaded programs is heavily influenced by the latencies of the thread management and synchronization operations. Improving these latencies becomes especially important when the parallelization is performed at fine... more

ABSTRACT There have been a number of successes in the past few years in use of formal methods for verification of real-time systems, and also in source-to-source transformation of these systems for improved analysis, performance, and... more

Bookmark
- by Thierry Gautier
- •
- 20
  Engineering, Programming Languages, Compilers, Control system

In modern superscalar microarchitectures that speculatively execute a great quantity of code, without performing branch prediction, it won't be possible to aggressively exploit program's instruction level parallelism. Both the... more

Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and... more

Two-level predictors deliver highly accurate conditional branch prediction, indirect branch target prediction and value prediction. Accurate prediction enables speculative execution of instructions, a technique that increases instruction... more

Bookmark
Download
- by Karel Driesen
- •
- 13
  Combinatorial Optimization, Problem Solving, Case Study, Low Power

Speculative Multithreading (SpMT) increases the performance by means of executing multiple threads speculatively to exploit thread-level parallelism. By combining software and hardware approaches, we have improved the capabilities of... more

This paper presents Threaded Multi-Path Execution (TME), which exploits existing hardware on a Simultaneous Multithreading (SMT) processor to speculatively execute multiple paths of execution. When there are fewer threads in an SMT... more

Bookmark
Download
- by Dean Tullsen
- •
- 2
  Speculative Execution, Synchronisation

Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to... more

Speculative locking (SL) protocols have been proposed in the literature for improving the performance of read-only transactions (ROTs) without correctness and data currency issues. In these protocols, ROTs carry out speculative executions... more

Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular... more

Bookmark
Download
- by Martin Burtscher
- •
- 2
  Data Structure, Speculative Execution

To improve the utilization of machine resources in superscalar processors, the instructions have to be carefully scheduled by the compiler. As internal parallelism and pipelining increases, it becomes evident that scheduling should be... more

... Among oth-ers, this support was provided by Rachel Allen, Scott Blomquist, Michael Chan, Cornelia Colyer, Mary Ann Ladd, Anne McCarthy, Marilyn Pierce, Lila Rhoades, Ty Sealy ... Most of all, I have relied on my wife, Kathleen... more

Recent research in thread-level speculation (TLS) has proposed several mechanisms for optimistic execution of difficultto-analyze serial codes in parallel. Though it has been shown that TLS helps to achieve higher levels of parallelism,... more

Replicated state machines are an important and widely-studied methodology for tolerating a wide range of faults. Unfortunately, while replicas should be distributed geographically for maximum fault tolerance, current replicated state... more

Bookmark
Download
- by Benjamin Wester
- •
- 3
  Operating Systems, Fault Tolerance, Speculative Execution

Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular... more

Bookmark
Download
- by Malik Hassaan
- •
- 19
  Languages, Computer Science, Algorithms, Distributed Computing

The WaveScalar is the first DataFlow Architecture that can efficiently provide the sequential memory semantics required by imperative languages. This work presents an alternative memory ordering mechanism for this architecture, the... more

Bookmark
Download
- by Vitor Costa
- •
- 8
  Transactional Memory, Cluster Computing, Dataflow, Wavescalar

This paper presents new achievements on the automatic mapping of abstract algorithms, written in imperative software programming languages, to custom computing machines. The reconfigurable hardware element of the target architecture... more

Instruction-level parallelism in a single stream of code for non-numerical applications has been the subject of many recent researches. This work extends the analysis to symbolic applications described with logic programming. In... more

Bookmark
Download
- by Alessandra Costa
- •
- 11
  Logic Programming, Performance, Scheduling, Parallel Processing

PEN-CHUNG YEW and ROY DZ-CHING JU, TIN-FOOK NGAI, SUN CHAN ________________________________________________________________________ Speculative execution, such as control speculation or data speculation, is an effective way to improve... more

Speculative execution, such as control speculation and data speculation, is an effective way to improve program performance. Using edge/path profile information or simple heuristic rules, existing compiler frameworks can adequately... more

The contribution of memory latency to execution time continues to increase, and latency hiding mechanisms become ever more important for efficient processor design. While high-end processors can use elaborate techniques like multiple... more

Bookmark
Download
- by Aviral Shrivastava
- •
- 11
  Embedded Systems, Hardware, Microarchitecture, Process Design

Predicated execution is an effective technique for dealing with conditional branches in application programs. However, there are several problems associated with conventional compiler support for predicated execution. First, all paths of... more

Bookmark
Download
- by David Lin
- •
- 11
  Automatic Control, Parallel Processing, Frequency, Perfect

Cloud computing systems use distributed file systems (DFSs) to store and process large data generated in the organizations. The users of the web-based information systems very frequently perform read operations and infrequently carry out... more

The AMD-K6 MMX-enabled processor is plugcompatible with the industry-standard Socket 7 and is binary compatible with the existing base of legacy X86 software. The microarchitecture is based on an out-of-order, superscalar execution engine... more

A long-running transaction is an interactive component of a distributed system which must be executed as if it were a single atomic action. In principle, it should not be interrupted or fail in the middle, and it must not be interleaved... more

Speculative Execution

Log In