18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu

18.
1 Các vấn đề về hiệu suất phần cứng
Giới thiệu
Processor design has increasingly focused on Instruction-Level Parallelism

(ILP) to enhance performance through:
- Pipelining: Instructions flow through multiple stages, processing different

instructions simultaneously.
- Superscalar Architecture: Multiple pipelines execute instructions in

parallel, as long as hazards are managed.
- Simultaneous Multithreading (SMT): Expanded registers allow multiple
threads to share pipeline resources, enhancing efficiency.
However, each advance adds complexity. Pipelines have grown longer,

requiring more control logic. Superscalar and SMT designs face
diminishing returns as managing multiple pipelines and threads becomes
harder. Increasing chip complexity also complicates design and
fabrication, shifting focus toward simpler memory logic, with power
demands further limiting design choices.
Mức độ tiêu thụ điện năng
To keep increasing
performance, designers have implemented complex processor features
(pipelining, superscalar, SMT) and raised clock frequencies, leading to
higher power demands. Increasing cache memory helps control power
density, as memory transistors consume significantly less power than
logic. As a result, memory now occupies up to half of the chip area,
although much area still goes to logic.
Efficient use of logic transistors is challenging, as scaling complexity has

diminishing returns. Pollack’s rule suggests that doubling logic only raises
performance by about 40%. Multicore designs offer potential for near-
linear performance gains, but only if software can leverage multiple cores.
Additionally, multiple threads can better utilize large cache memory,
supporting a multicore approach.
18.2 Các vấn đề về hiệu suất phần mềm
Ứng dụng trên nền tảng đa lõi

The performance benefits of multicore systems depend on effectively
leveraging parallel resources. According to Amdahl’s Law, even a small
fraction of serial code can limit overall speedup.
For instance, with 10% serial code (f = 0.9), a program running on eight
cores would achieve only a 4.7x speedup. Additionally, parallel processing
introduces overhead from communication, task distribution, and cache
coherence, which can cause performance to peak and then degrade as
the number of cores increases.
However,
many applications can still take advantage of multicore systems.
Database management systems are one example where careful attention
to reducing serial portions of code allows for efficient multicore use.
Servers also benefit from multicore organization, as they handle numerous
independent transactions simultaneously.
In addition to general-purpose server software, several application types
benefit from scaling throughput with additional cores, including:
- Multithreaded native applications (thread-level parallelism): These

applications consist of a few highly threaded processes.
- Multiprocess applications (process-level parallelism): These feature many

single-threaded processes.
- Java applications: Java inherently supports threading, and the Java Virtual
Machine (JVM) is designed for multithreading, providing efficient
scheduling and memory management.
- Multi-instance applications (application-level parallelism): Applications

that don't scale well with threads can still benefit by running multiple
instances in parallel. Virtualization can provide each instance with its own
secure environment if needed.
A key concept in thread-level parallelism is threading granularity, which

refers to the smallest unit of work that can be effectively parallelized.
Finer granularity allows greater flexibility for programmers but increases
overhead due to more frequent context switching and management tasks.
This creates a tradeoff, as finer systems can lead to more efficient
parallelization, but at the cost of increased overhead.
Application Example: Valve Game Software
Valve, known for popular games and the Source engine, has enhanced the
Source engine to leverage multithreading for better scalability on
multicore processors from Intel and AMD. This upgrade improves the
performance of games like Half-Life 2.
Threading Granularity Options
Valve defines three threading granularity strategies:
- Coarse-grained threading: Individual modules (e.g., rendering, AI,

physics) are assigned to separate processors. Each module runs as a
single thread, coordinated by a timeline thread. This approach can
achieve up to twice the performance on two processors under ideal
conditions, though real-world gameplay typically sees about a 1.2x
improvement.
- Fine-grained threading: Similar tasks are distributed across multiple

processors. For example, an array iteration can be divided into smaller
parallel loops. However, this approach is complex due to variable work
unit times and intricate outcome management.
- Hybrid threading: This combines fine-grain threading for some systems
while keeping others single-threaded. Valve found this approach most
effective for future multicore systems with eight or sixteen processors.
Valve identified certain systems, like sound mixing, that function well on a
single processor without the need for interaction or timing constraints. In
contrast, scene rendering can be divided into multiple threads, benefiting
from parallel processing.
Rendering Module Strategy
The rendering module’s threading strategy includes:
- Constructing scene-rendering lists for multiple scenes simultaneously.
- Overlapping graphics simulations to optimize rendering.
- Computing character bone transformations for all characters in parallel.
- Allowing multiple threads to draw concurrently.
This hierarchical threading structure enables the rendering module to

efficiently manage the visual elements of the game, ensuring a smoother
gameplay experience.
Designers found that locking key databases, such as the world list, was
inefficient, as threads spent over 95% of their time reading data and only
about 5% writing. They adopted a single-writer-multiple-readers model to
optimize performance, allowing multiple threads to read simultaneously
while restricting write access to one thread at a time.
18.3 TỔ CHỨC ĐA LÕI
Các cấp độ của bộ nhớ cache

Figure 18.6 illustrates four general organizations for multicore systems:
1. Figure 18.6a: Early multicore designs, like the ARM11 MPCore, feature
individual L1 caches for each core, while L2 and higher-level caches are
unified.
2. Figure 18.6b: Similar to the first, this organization includes L2 cache

without on-chip cache sharing, exemplified by the AMD Opteron.
3. Figure 18.6c: This structure utilizes a shared L2 cache, as seen in the

Intel Core Duo, allowing cores to benefit from shared data.
4. Figure 18.6d: With increasing cache memory, systems like the Intel Core
i7 employ dedicated L1 and L2 caches alongside a shared L3 cache.
The shared higher-level cache offers several advantages:
1. Reduced Miss Rates: If one core accesses data, it's cached for others,
improving access speed.
2. No Data Replication: Shared data isn't duplicated, saving cache space.

3. Dynamic Allocation: Cache can be allocated based on thread locality,
allowing flexibility for working sets.
4. Simplified Inter-Core Communication: Shared memory locations

facilitate communication between cores.
5. Focused Cache Coherency: Confines the coherency problem to lower

levels, enhancing performance.
Dedicated L2 caches provide rapid access for cores, benefiting threads

with strong locality. However, as memory and core counts increase, a
combination of shared L3 with dedicated L2 caches is expected to yield
better performance. Future architectures may feature L1 caches per core,
L2 caches shared between 2 to 4 cores, and a global L3 cache across all
cores.
ĐA LUỒNG ĐỒNG THỜI
Another critical design consideration for multicore systems is whether to

implement simultaneous multithreading (SMT) in individual cores. For
instance, the Intel Core Duo features pure superscalar cores, while the
Intel Core i7 incorporates SMT cores.
SMT allows a multicore system to scale the number of hardware-level

threads it can support. Therefore, a system with four cores, each capable
of handling four simultaneous threads, functions similarly at the
application level to a system with 16 separate cores. As software evolves
to take better advantage of parallel resources, adopting an SMT approach
becomes increasingly appealing compared to a strictly superscalar
design.
Trắc nghiệm
1. What is the main benefit of multicore systems according to Amdahl's

law?
- A) Increased serial execution time
- B) Improved performance by exploiting parallelism
- C) Reduced memory requirements
- D) Decreased overhead in task management
Answer: B) Improved performance by exploiting parallelism

2. In the context of multicore systems, what does the term 'threading
granularity' refer to?
- A) The maximum number of threads per core
- B) The size of individual threads in memory
- C) The minimal unit of work that can be parallelized
- D) The total number of cores in a system
Answer: C) The minimal unit of work that can be parallelized
3. Which threading approach was found to provide the best scalability for
Valve's Source engine?
- A) Coarse-grained threading
- B) Fine-grained threading
- C) Hybrid threading
- D) Single-threaded execution
Answer: C) Hybrid threading
4. What advantage does a shared L3 cache provide in a multicore

architecture?
- A) Increased data replication
- B) Faster access times for individual cores
- C) Reduced overall cache miss rates
- D) More complex cache coherence issues
Answer: C) Reduced overall cache miss rates
5. What is one characteristic of multithreaded applications?
- A) They contain only a single-threaded process.
- B) They require no coordination between threads.
- C) They have a small number of highly threaded processes.
- D) They are always faster than single-threaded applications.
Answer: C) They have a small number of highly threaded processes.

Câu tự luận
1. Discuss the implications of Amdahl's Law on the design of software for

multicore processors. How can software engineers optimize applications to
take full advantage of multicore architectures?
2. Compare and contrast the benefits and drawbacks of using

simultaneous multithreading (SMT) versus pure superscalar architectures
in multicore systems. How does this choice impact overall system
performance and application development?

18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu

Uploaded by

Copyright:

Available Formats

18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu

Uploaded by

Copyright:

Available Formats

18.

1 Các vấn đề về hiệu suất phần cứng

Processor design has increasingly focused on Instruction-Level Parallelism

- Pipelining: Instructions flow through multiple stages, processing different

- Superscalar Architecture: Multiple pipelines execute instructions in

However, each advance adds complexity. Pipelines have grown longer,

Efficient use of logic transistors is challenging, as scaling complexity has

18.2 Các vấn đề về hiệu suất phần mềm

Ứng dụng trên nền tảng đa lõi

- Multithreaded native applications (thread-level parallelism): These

- Multiprocess applications (process-level parallelism): These feature many

- Multi-instance applications (application-level parallelism): Applications

A key concept in thread-level parallelism is threading granularity, which

Application Example: Valve Game Software

Threading Granularity Options

Valve defines three threading granularity strategies:

- Coarse-grained threading: Individual modules (e.g., rendering, AI,

- Fine-grained threading: Similar tasks are distributed across multiple

Rendering Module Strategy

The rendering module’s threading strategy includes:

- Constructing scene-rendering lists for multiple scenes simultaneously.

- Overlapping graphics simulations to optimize rendering.

- Computing character bone transformations for all characters in parallel.

- Allowing multiple threads to draw concurrently.

This hierarchical threading structure enables the rendering module to

18.3 TỔ CHỨC ĐA LÕI

Các cấp độ của bộ nhớ cache

2. Figure 18.6b: Similar to the first, this organization includes L2 cache

3. Figure 18.6c: This structure utilizes a shared L2 cache, as seen in the

The shared higher-level cache offers several advantages:

2. No Data Replication: Shared data isn't duplicated, saving cache space.

4. Simplified Inter-Core Communication: Shared memory locations

5. Focused Cache Coherency: Confines the coherency problem to lower

Dedicated L2 caches provide rapid access for cores, benefiting threads

ĐA LUỒNG ĐỒNG THỜI

Another critical design consideration for multicore systems is whether to

SMT allows a multicore system to scale the number of hardware-level

1. What is the main benefit of multicore systems according to Amdahl's

- A) Increased serial execution time

- B) Improved performance by exploiting parallelism

- C) Reduced memory requirements

- D) Decreased overhead in task management

Answer: B) Improved performance by exploiting parallelism

- A) The maximum number of threads per core

- B) The size of individual threads in memory

- C) The minimal unit of work that can be parallelized

- D) The total number of cores in a system

Answer: C) The minimal unit of work that can be parallelized

Answer: C) Hybrid threading

4. What advantage does a shared L3 cache provide in a multicore

- A) Increased data replication

- B) Faster access times for individual cores

- C) Reduced overall cache miss rates

- D) More complex cache coherence issues

Answer: C) Reduced overall cache miss rates

5. What is one characteristic of multithreaded applications?

- A) They contain only a single-threaded process.

- B) They require no coordination between threads.

- C) They have a small number of highly threaded processes.