18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

18.

1 Các vấn đề về hiệu suất phần cứng

Giới thiệu

Processor design has increasingly focused on Instruction-Level Parallelism


(ILP) to enhance performance through:

- Pipelining: Instructions flow through multiple stages, processing different


instructions simultaneously.

- Superscalar Architecture: Multiple pipelines execute instructions in


parallel, as long as hazards are managed.
- Simultaneous Multithreading (SMT): Expanded registers allow multiple
threads to share pipeline resources, enhancing efficiency.

However, each advance adds complexity. Pipelines have grown longer,


requiring more control logic. Superscalar and SMT designs face
diminishing returns as managing multiple pipelines and threads becomes
harder. Increasing chip complexity also complicates design and
fabrication, shifting focus toward simpler memory logic, with power
demands further limiting design choices.
Mức độ tiêu thụ điện năng

To keep increasing
performance, designers have implemented complex processor features
(pipelining, superscalar, SMT) and raised clock frequencies, leading to
higher power demands. Increasing cache memory helps control power
density, as memory transistors consume significantly less power than
logic. As a result, memory now occupies up to half of the chip area,
although much area still goes to logic.

Efficient use of logic transistors is challenging, as scaling complexity has


diminishing returns. Pollack’s rule suggests that doubling logic only raises
performance by about 40%. Multicore designs offer potential for near-
linear performance gains, but only if software can leverage multiple cores.
Additionally, multiple threads can better utilize large cache memory,
supporting a multicore approach.

18.2 Các vấn đề về hiệu suất phần mềm

Ứng dụng trên nền tảng đa lõi


The performance benefits of multicore systems depend on effectively
leveraging parallel resources. According to Amdahl’s Law, even a small
fraction of serial code can limit overall speedup.

For instance, with 10% serial code (f = 0.9), a program running on eight
cores would achieve only a 4.7x speedup. Additionally, parallel processing
introduces overhead from communication, task distribution, and cache
coherence, which can cause performance to peak and then degrade as
the number of cores increases.
However,
many applications can still take advantage of multicore systems.
Database management systems are one example where careful attention
to reducing serial portions of code allows for efficient multicore use.
Servers also benefit from multicore organization, as they handle numerous
independent transactions simultaneously.
In addition to general-purpose server software, several application types
benefit from scaling throughput with additional cores, including:

- Multithreaded native applications (thread-level parallelism): These


applications consist of a few highly threaded processes.

- Multiprocess applications (process-level parallelism): These feature many


single-threaded processes.

- Java applications: Java inherently supports threading, and the Java Virtual
Machine (JVM) is designed for multithreading, providing efficient
scheduling and memory management.

- Multi-instance applications (application-level parallelism): Applications


that don't scale well with threads can still benefit by running multiple
instances in parallel. Virtualization can provide each instance with its own
secure environment if needed.

A key concept in thread-level parallelism is threading granularity, which


refers to the smallest unit of work that can be effectively parallelized.
Finer granularity allows greater flexibility for programmers but increases
overhead due to more frequent context switching and management tasks.
This creates a tradeoff, as finer systems can lead to more efficient
parallelization, but at the cost of increased overhead.

Application Example: Valve Game Software

Valve, known for popular games and the Source engine, has enhanced the
Source engine to leverage multithreading for better scalability on
multicore processors from Intel and AMD. This upgrade improves the
performance of games like Half-Life 2.

Threading Granularity Options

Valve defines three threading granularity strategies:

- Coarse-grained threading: Individual modules (e.g., rendering, AI,


physics) are assigned to separate processors. Each module runs as a
single thread, coordinated by a timeline thread. This approach can
achieve up to twice the performance on two processors under ideal
conditions, though real-world gameplay typically sees about a 1.2x
improvement.

- Fine-grained threading: Similar tasks are distributed across multiple


processors. For example, an array iteration can be divided into smaller
parallel loops. However, this approach is complex due to variable work
unit times and intricate outcome management.
- Hybrid threading: This combines fine-grain threading for some systems
while keeping others single-threaded. Valve found this approach most
effective for future multicore systems with eight or sixteen processors.

Valve identified certain systems, like sound mixing, that function well on a
single processor without the need for interaction or timing constraints. In
contrast, scene rendering can be divided into multiple threads, benefiting
from parallel processing.

Rendering Module Strategy

The rendering module’s threading strategy includes:

- Constructing scene-rendering lists for multiple scenes simultaneously.

- Overlapping graphics simulations to optimize rendering.

- Computing character bone transformations for all characters in parallel.

- Allowing multiple threads to draw concurrently.

This hierarchical threading structure enables the rendering module to


efficiently manage the visual elements of the game, ensuring a smoother
gameplay experience.
Designers found that locking key databases, such as the world list, was
inefficient, as threads spent over 95% of their time reading data and only
about 5% writing. They adopted a single-writer-multiple-readers model to
optimize performance, allowing multiple threads to read simultaneously
while restricting write access to one thread at a time.

18.3 TỔ CHỨC ĐA LÕI

Các cấp độ của bộ nhớ cache


Figure 18.6 illustrates four general organizations for multicore systems:

1. Figure 18.6a: Early multicore designs, like the ARM11 MPCore, feature
individual L1 caches for each core, while L2 and higher-level caches are
unified.

2. Figure 18.6b: Similar to the first, this organization includes L2 cache


without on-chip cache sharing, exemplified by the AMD Opteron.

3. Figure 18.6c: This structure utilizes a shared L2 cache, as seen in the


Intel Core Duo, allowing cores to benefit from shared data.

4. Figure 18.6d: With increasing cache memory, systems like the Intel Core
i7 employ dedicated L1 and L2 caches alongside a shared L3 cache.

The shared higher-level cache offers several advantages:

1. Reduced Miss Rates: If one core accesses data, it's cached for others,
improving access speed.

2. No Data Replication: Shared data isn't duplicated, saving cache space.


3. Dynamic Allocation: Cache can be allocated based on thread locality,
allowing flexibility for working sets.

4. Simplified Inter-Core Communication: Shared memory locations


facilitate communication between cores.

5. Focused Cache Coherency: Confines the coherency problem to lower


levels, enhancing performance.

Dedicated L2 caches provide rapid access for cores, benefiting threads


with strong locality. However, as memory and core counts increase, a
combination of shared L3 with dedicated L2 caches is expected to yield
better performance. Future architectures may feature L1 caches per core,
L2 caches shared between 2 to 4 cores, and a global L3 cache across all
cores.

ĐA LUỒNG ĐỒNG THỜI

Another critical design consideration for multicore systems is whether to


implement simultaneous multithreading (SMT) in individual cores. For
instance, the Intel Core Duo features pure superscalar cores, while the
Intel Core i7 incorporates SMT cores.

SMT allows a multicore system to scale the number of hardware-level


threads it can support. Therefore, a system with four cores, each capable
of handling four simultaneous threads, functions similarly at the
application level to a system with 16 separate cores. As software evolves
to take better advantage of parallel resources, adopting an SMT approach
becomes increasingly appealing compared to a strictly superscalar
design.

Trắc nghiệm

1. What is the main benefit of multicore systems according to Amdahl's


law?

- A) Increased serial execution time

- B) Improved performance by exploiting parallelism

- C) Reduced memory requirements

- D) Decreased overhead in task management

Answer: B) Improved performance by exploiting parallelism


2. In the context of multicore systems, what does the term 'threading
granularity' refer to?

- A) The maximum number of threads per core

- B) The size of individual threads in memory

- C) The minimal unit of work that can be parallelized

- D) The total number of cores in a system

Answer: C) The minimal unit of work that can be parallelized

3. Which threading approach was found to provide the best scalability for
Valve's Source engine?

- A) Coarse-grained threading

- B) Fine-grained threading

- C) Hybrid threading

- D) Single-threaded execution

Answer: C) Hybrid threading

4. What advantage does a shared L3 cache provide in a multicore


architecture?

- A) Increased data replication

- B) Faster access times for individual cores

- C) Reduced overall cache miss rates

- D) More complex cache coherence issues

Answer: C) Reduced overall cache miss rates

5. What is one characteristic of multithreaded applications?

- A) They contain only a single-threaded process.

- B) They require no coordination between threads.

- C) They have a small number of highly threaded processes.

- D) They are always faster than single-threaded applications.

Answer: C) They have a small number of highly threaded processes.


Câu tự luận

1. Discuss the implications of Amdahl's Law on the design of software for


multicore processors. How can software engineers optimize applications to
take full advantage of multicore architectures?

2. Compare and contrast the benefits and drawbacks of using


simultaneous multithreading (SMT) versus pure superscalar architectures
in multicore systems. How does this choice impact overall system
performance and application development?

You might also like