SMM Cap1

Titular curs: dr. ing. Ioan Ungurean [email protected].
ro
Durat:
14 sptmni x 3 ore
Cum se face notarea?

Nota la examen NF = 45%NE + 45%NL + 10%PC NF = Nota Final NE = Nota la Examen NL = Nota la Laborator PC = Prezena la Curs Examenul se desfoar n scris. Se dau trei subiecte sub form de probleme Prezena la curs poate fi compensat cu o tem suplimentar legat de curs, cu aprobarea cadrului didactic. Nota la examen se calculeaz cu activitile executate i notate i trebuie s fie de cel puin 45% (4,5). Se dau bonusuri de pn 10% pentru activiti suplimentare (participarea la concursuri, conferine studeneti, proiecte, etc.)
Bibliografie
1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura Victor, ISBN 973-8128-05-6, 2000 2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw Hill, 1993, ISBN 0-07-113342-9. 3. Hesham El-Rewini, Southern Methodist University, Mostafa AbdEl-Barr Kuwait University, ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING, A JOHN WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5 4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-3 5. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall 2003, ISBN 0-13-145660+1
Problemele int ale cursului
Arhitecturi multiprocesor Magistrale de interconectare PCI, PCIe, Infiniband Chipuri multimicroprocesor Intel Xeon, Cell BE Cluster computing Grid computing
Introducere
1.1 Why Parallel Processing?

TIPS
Processor performance
1.6 / yr GIPS
Pentium II 68040 R10000 Pentium 80486 80386
MIPS
68000 80286
KIPS 1980
1990
2000
2010
Calendar year
Fig. 1.1 The exponential growth of microprocessor performance, known as Moores Law, shown over the past two decades (extrapolated).
Evolution of Computer Performance/Cost
From: Robots After All, by H. Moravec, CACM, pp. 90-97, October 2003.
Mental power in four scales
The Semiconductor Technology Roadmap

Calendar year 2001 2004 2007 2010 2013 2016
Halfpitch (nm)
Clock freq. (GHz) Wiring levels Power supply (V) Max. power (W)
140
2 7 1.1 130
90
4 8 1.0 160
65
7 9 0.8 190
45
12 10 0.7 220
32
20 10 0.6 250
22
30 10 0.5 290
From the 2001 edition of the roadmap [Alla02]

TIPS
Factors contributing to the validity of Moores law Denser circuits; Architectural improvements Measures of processor performance Instructions/second (MIPS, GIPS, TIPS, PIPS) Floating-point operations per second (MFLOPS, GFLOPS, TFLOPS, PFLOPS) Running time on benchmark suites
Processor performance
1.6 / yr GIPS
Pentium II 68040 R10000 Pentium 80486 80386
MIPS
68000 80286
KIPS 1980
1990
2000
2010
Calendar year
Why High-Performance Computing?
Higher speed (solve problems faster) Important when there are hard or soft deadlines; e.g., 24-hour weather forecast Higher throughput (solve more problems) Important when there are many similar tasks to perform; e.g., transaction processing Higher computational power (solve larger problems) e.g., weather forecast for a week rather than 24 hours, or with a finer mesh for greater accuracy
Categories of supercomputers Uniprocessor; aka vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors)
The Speed-of-Light Argument
The speed of light is about 30 cm/ns. Signals travel at a fraction of speed of light (say, 1/3). If signals must travel 1 cm during the execution of an instruction, that instruction will take at least 0.1 ns; thus, performance will be limited to 10 GIPS. This limitation is eased by continued miniaturization, architectural methods such as cache memory, etc.; however, a fundamental limit does exist. How does parallel processing help? Wouldnt multiple processors need to communicate via signals as well?
Why Do We Need TIPS or TFLOPS Performance?
Reasonable running time = Fraction of hour to several hours (103-104 s) In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations
Example 1: Southern oceans heat Modeling (10-minute iterations) 300 GFLOP per iteration 300 000 iterations per 6 yrs = 1016 FLOP
Example 2: Fluid dynamics calculations (1000 1000 1000 lattice) 109 lattice points 1000 FLOP/point 10 000 time steps = 1016 FLOP Example 3: Monte Carlo simulation of nuclear reactor 1011 particles to track (for 1000 escapes) 104 FLOP/particle = 1015 FLOP Decentralized supercomputing ( from Mathworld News, 2006/4/7 ): Grid of tens of thousands networked computers discovers 230 402 457 1, the 43rd Mersenne prime, as the largest known prime (9 152 052 digits )
The ASCI Program

1000
Plan Develop Use
Performance (TFLOPS)
100
ASCI Purple ASCI Q
100+ TFLOPS, 20 TB
30+ TFLOPS, 10 TB
10+ TFLOPS, 5 TB
10
3+ TFLOPS, 1.5 TB
ASCI White ASCI Blue

1+ TFLOPS, 0.5 TB
ASCI
1 1995
ASCI Red 2000 2005 2010
Calendar year Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &) Computing Initiative (ASCI) program, sponsored by the US Department of Energy, with extrapolation up to the PFLOPS level.
The Quest for Higher Performance

Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)
1. IBM Blue Gene/L

LLNL, California
2. SGI Columbia
NASA Ames, California
3. NEC Earth Sim

Earth Sim Ctr, Yokohama
Material science, nuclear stockpile sim 32,768 procs, 8 TB, 28 TB disk storage Linux + custom OS 71 TFLOPS, $100 M Dual-proc Power-PC chips (10-15 W power) Full system: 130k-proc, 360 TFLOPS (est)
Aerospace/space sim, climate research 10,240 procs, 20 TB, 440 TB disk storage Linux 52 TFLOPS, $50 M 20x Altix (512 Itanium2) linked by Infiniband
Atmospheric, oceanic, and earth sciences 5,120 procs, 10 TB, 700 TB disk storage Unix 36 TFLOPS*, $400 M? Built of custom vector microprocessors Volume = 50x IBM, Power = 14x IBM
* Led the top500 list for 2.5 yrs
The Quest for Higher Performance: 2008 Update

Top Three Supercomputers in June 2008 (http://www.top500.org)
1. IBM Roadrunner
LANL, New Mexico Nuclear stockpile calculations, and more 122,400 procs, 98 TB, 0.4 TB/s file system I/O Red Hat Linux 1.38 PFLOPS, $130M PowerXCell 8i 3.2 GHz, AMD Opteron (hybrid) 2.35 MW power, expands to 1M procs
2. IBM Blue Gene/L

LLNL, California Advanced scientific simulations 212,992 procs, 74 TB, 2 PB disk storage CNK/SLES 9 0.596 PFLOPS, $100M PowerPC 440 700 MHz 1.60 MW power, expands to 0.5M procs
3. Sun Blade X6420

U Texas Austin Open science research 62,976 procs, 126 TB Linux 0.504 PFLOPS* AMD X86-64 Opteron quad core 2 GHz 2.00 MW power, Expands to 0.3M procs
* Actually 4th on top-500 list, with the 3rd being another IBM Blue Gene system at 0.557 PFLOPS
Supercomputer Performance Growth

PFLOPS
ASCI goals
Supercomputer performance
$240M MPPs $30M MPPs
TFLOPS
CM-2
CM-5 CM-5
Vector supers
Micros Y-MP
GFLOPS
Alpha Cray X-MP 80860 80386
MFLOPS 1980
1990
2000
2010
Calendar year
Fig. 1.2 The exponential growth in supercomputer performance over the past two decades (from [Bell92], with ASCI performance goals and microprocessor peak FLOPS superimposed as dotted lines).
What Exactly is Parallel Processing?
Parallelism = Concurrency Doing more than one thing at a time Has been around for decades, since early computers I/O channels, DMA, device controllers, multiple ALUs The sense in which we use it in this course Multiple agents (hardware units, software processes) collaborate to perform our main computational task - Multiplying two matrices - Breaking a secret code - Deciding on the next chess move
ABCs of Parallel Processing in One Slide A Amdahls Law (Speedup Formula)

Bad news Sequential overhead will kill you, because: Speedup = T1/Tp 1/[f + (1 f)/p] min(1/f, p) Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS.
Brents Scheduling Theorem

Good news Optimal scheduling is very difficult, but even a naive scheduling algorithm can ensure: T1/p Tp T1/p + T = (T1/p)[1 + p/(T1/T)] Result: For a reasonably parallel task (large T1/T), or for a suitably small p (say, p T1/T), good speedup and efficiency are possible.
Cost-Effectiveness Adage
Real news The most cost-effective parallel solution may not be the one with highest peak OPS (communication?), greatest speed-up (at what cost?), or best utilization (hardware busy doing what?). Analogy: Mass transit might be more cost-effective than private cars even if it is slower and leads to many empty seats.
Historical Perspective
Types of Parallelism
Note: many overlaps

lookahead & pipelining vectorization concurrency & simultaneity data and control parallelism partitioning & specialization interleaving & overlapping of physical subsystems multiplicity & replication time & space sharing multitasking & multiprogramming multi-threading distributed computing - for speed or availability
Scalar
Sequential Lookahead
I/E Overlap
Functional Parallelism
Multiple Pipeline Func. Units Implicit Vector Explicit Vector
Architectural
Memory-tomemory SIMD
Register-toregister MIMD
Evolution
Associative Processor
Multicomputer
Processor Array Mutiprocessor
Parallel Computer Architectures
(a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A multicomputer. (e) A grid.
Instruction-Level Parallelism
(a) A CPU pipeline. (b) A sequence of VLIW instructions. (c) An instruction stream with bundles marked.
The TriMedia VLIW CPU (1)
A typical TriMedia instruction, showing five possible operations.

The TM3260 functional units, their quantity, latency, and which instruction slots they can use.
The major groups of TriMedia custom operations.
(a) An array of 8-bit elements. (b) The transposed array. (c) The original array fetched into four registers. (d) The transposed array in four registers.
On-Chip Multithreading (1)
(a) (c) Three threads. The empty boxes indicated that the thread has stalled waiting for memory. (d) Fine-grained multithreading. (e) Coarse-grained multithreading.
On-Chip Multithreading (2)
Multithreading with a dual-issue superscalar CPU. (a) Fine-grained multithreading. (b) Coarse-grained multithreading. (c) Simultaneous multithreading.
Hyperthreading on the Pentium 4
Resource sharing between threads in the Pentium 4 NetBurst microarchitecture.
Homogeneous Multiprocessors on a Chip
Single-chip multiprocessors. (a) A dual-pipeline chip. (b) A chip with two cores.
Heterogeneous Multiprocessors on a Chip (1)
The logical structure of a simple DVD player contains a heterogeneous multiprocessor containing multiple cores for different functions.
Heterogeneous Multiprocessors on a Chip (2)
An example of the IBM CoreConnect architecture.
Introduction to Networking (1)

How users are connected to servers on the Internet.
Introduction to Networking (2)
A packet as it appears on the Ethernet.
Introduction to Network Processors

A typical network processor board and chip.
The Nexperia Media Processor
The Nexperia heterogeneous multiprocessor on a chip.
Multiprocessors
(a) A multiprocessor with 16 CPUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different CPU.
Multicomputers (1)
(a) A multicomputer with 16 CPUs, each with its own private

memory. (b) The bit-map image of Fig. 8-17 split up among the 16 memories.
Multicomputers (2)
Various layers where shared memory can be implemented. (a) The hardware. (b) The operating system. (c) The language runtime system.
Hybrid approach

In short, programming a multicomputer is much more difficult than programming a multiprocessor. Under these conditions, why would anyone build multicomputers, when multiprocessors are easier to program? The answer is simple: large multicomputers are much simpler and cheaper to build than multiprocessors with the same number of CPUs. Implementing a memory shared by even a few hundred CPUs is a substantial undertaking, whereas building a multicomputer with 10,000 CPUs or more is straightforward.
Hybrid approach cont
Thus we have a dilemma: multiprocessors are hard to build but easy to program whereas multicomputers are easy to build but hard to program. This observation has led to a great deal of effort to construct hybrid systems that are relatively easy to build and relatively easy to program. This work has led to the realization that shared memory can be implemented in various ways, each with its own set of advantages and disadvantages. In fact, much of the research in parallel architectures these days relates to the convergence of multiprocessor and multicomputer architectures into hybrid forms that combine the strengths of each. The holy grail here is to find designs that are scalable, that is, continue to perform well as more and more CPUs are added.
One approach to building hybrid systems is based on the fact that
modern computer systems are not monolithic but are constructed as a series of layers. This insight opens the possibility of implementing the shared memory at any one of several layers, as shown in Fig. 8-3. In Fig. 8-3(a) we see the shared memory being implemented by the hardware as a true multiprocessor. In this design, there is a single copy of the operating system with a single set of tables, in particular, the memory allocation table. When a process needs more memory, it traps to the operating system, which then looks in its table for a free page and maps the page into the callers address space. As far as the operating system is concerned, there is a single memory and it keeps track of which process owns which page in software. There are many ways to implement hardware shared memory, as we will see later.
A second possibility is to use multicomputer hardware and have
the operating system simulate shared memory by providing a single system-wide paged shared virtual address space. In this approach, called DSM (Distributed Shared Memory) (Li, 1988; and Li and Hudak, 1986, 1989), each page is located in one of the memories of Fig. 8-2(a). Each machine has its own virtual memory and its own page tables. When a CPU does a LOAD or STORE on a page it does not have, a trap to the operating system occurs. The operating system then locates the page and asks the CPU currently holding it to unmap the page and send it over the interconnection network. When it arrives, the page is mapped in and the faulting instruction is restarted. In effect, the operating system is just satisfying page faults from remote memory instead of from disk. To the user, the machine looks as if it has shared memory.
A third possibility is to have a user-level runtime
system implement a (possibly language-specific) form of shared memory. In this approach, the programming language provides some kind of shared memory abstraction, which is then implemented by the compiler and runtime system. For example, the Linda model is based on the abstraction of a shared space of tuples (data records containing a collection of fields). Processes on any machine can input a tuple from the shared tuple space or output a tuple to the shared tuple space. Because access to the tuple space is controlled entirely in software (by the Linda runtime system), no special hardware or operating system support is needed.
Topology
Various topologies. The heavy dots represent switches. The CPUs and memories are not shown. (a) A star. (b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.
Topology - cont
When designing or analyzing an interconnection network, several areas stand out as being important. First, there is the matter of topology, that is, how the components are arranged. Second is how the switching works and how to handle contention for resources. Third is what routing algorithm is used to get messages to their destination efficiently.
Topology - cont
The topology of an interconnection network describes how the links and switches are arranged, for example, as a ring or as a grid. Topological designs can be modeled as graphs, with the links as arcs and the switches as nodes, as shown in Fig. 8-4. Each node in an interconnection network (or its graph) has some number of links connected to it. Mathematicians call the number of links the degree of the node; engineers call it the fanout. In general, the greater the fanout, the more routing choices there are and the greater the fault tolerance, that is, the ability to continue functioning even if a link fails by routing around it. If every node has k arcs and the wiring is done right, it is possible to design the network so that it remains fully connected even if k - 1 links fail.
Topology - cont
Another property of an interconnection network (or its graph) is its diameter. If we measure the distance between two nodes by the number of arcs that have to be traversed to get from one to the other, then the diameter of a graph is the distance between the two nodes that are the farthest apart (i.e., have the greatest distance between them). The diameter of an interconnection network is related to the worst-case delay when sending packets from CPU to CPU or from CPU to memory because each hop across a link takes a finite amount of time. The smaller the diameter is, the better the worst-case performance is. Also important is the average distance between two nodes, since this relates to the average packet transit time.
Topology - cont

Yet another important property of an interconnection network is its transmission capacity, that is, how much data it can move per second. One useful measure of this capacity is the bisection bandwidth. To compute this quantity, we first have to (conceptually) partition the network into two equal (in terms of number of nodes) but unconnected parts by removing a set of arcs from its graph. Then we compute the total bandwidth of the arcs that have been removed. There may be many different ways to partition the network into two equal parts. The bisection bandwidth is the minimum of all the possible partitions. The significance of this number is that if the bisection bandwidth is, say, 800 bits/sec, then if there is a lot of communication between the two halves, the total throughput may be limited to only 800 bits/sec, in the worst case. Many designers believe bisection bandwidth is the most important metric of in interconnection network. Many interconnection networks are designed with the goal of maximizing the bisection bandwidth.
Topology - cont

Interconnection networks can be characterized by their dimensionality. For our purposes, the dimensionality is determined by the number of choices there are to get from the source to the destination. If there is never any choice (i.e., there is only one path from each source to each destination), the network is zero dimensional. If there is one dimension in which a choice can be made, for example, go east or go west, the network is one dimensional. If there are two axes, so a packet can go east or west or alternatively, go north or south, the network is two dimensional, and so on.
Topology - cont
Several topologies are shown in Fig. 8-4 . Only the links (lines) and switches (dots) are shown here. The memories and CPUs (not shown) would typically be attached to the switches by interfaces. In Fig. 8-4(a), we have a zero-dimensional star configuration, in which the CPUs and memories would be attached to the outer nodes, with the central one just doing switching.
Although a simple design, for a large system, the central switch is likely to be a major bottleneck. Also, from a fault-tolerance perspective, this is a poor design since a single failure at the central switch completely destroys the system.
Topology - cont
In Fig. 8-4(b), we have another zerodimensional design that is at the other end of the spectrum, a full interconnect. Here every node has a direct connection to every other node.
This design maximizes the bisection bandwidth, minimizes the diameter, and is exceedingly fault tolerant (it can lose any six links and still be fully connected). Unfortunately, the number of links required for k nodes is k(k - 1)/2, which quickly gets out of hand for large k.
Topology - cont
Yet a third zero-dimensional topology is the tree, illustrated in Fig. 8-4(c). A problem with this design is that the bisection bandwidth is equal to the link capacity. Since there will normally be a lot of traffic near the top of the tree, the top few nodes will become bottlenecks. One way around this problem is to increase the bisection bandwidth by giving the upper links more bandwidth. For example, the lowest level links might have a capacity b, the next level might have a capacity 2b and the toplevel links might each have 4b. Such a design is called a fat tree and has been used in commercial multicomputers, such as the (now-defunct) Thinking Machines CM-5.
Topology - cont
The ring of Fig. 8-4(d) is a one-dimensional topology by our definition because every packet sent has a choice of going left or going right. The grid or mesh of Fig. 8-4(e) is a two-dimensional design that has been used in many commercial systems. It is highly regular, easy to scale up to large sizes, and has a diameter that only increases as the square root of the number of nodes. A variant on the grid is the double torus of Fig. 8-4(f), which is a grid with the edges connected. Not only is it more fault tolerant than the grid, but the diameter is also less because the opposite corners can now communicate in only two hops.
Topology - cont
The cube of Fig. 8-4(g) is a regular three-dimensional topology. We have illustrated a 2 2 2 cube, but in the general case it could be a k k k cube. In Fig. 8-4(h) we have a four-dimensional cube constructed from two threedimensional cubes with the corresponding edges connected. We could make a five-dimensional cube by cloning the structure of Fig. 8-4(h) and connecting the corresponding nodes to form a block of four cubes. To go to six dimensions, we could replicate the block of four cubes and interconnect the corresponding nodes, and so on.
Topology - cont

An n-dimensional cube formed this way is called a hypercube. Many parallel computers use this topology because the diameter grows linearly with the dimensionality. Put in other words, the diameter is the base 2 logarithm of the number of nodes, so, for example, a 10dimensional hypercube has 1024 nodes but a diameter of only 10, giving excellent delay properties. Note that in contrast, 1024 nodes arranged as a 32 32 grid has a diameter of 62, more than six times worse than the hypercube. The price paid for the smaller diameter is that the fanout and thus the number of links (and the cost) is much larger for the hypercube. Nevertheless, the hypercube is a common choice for high-performance systems.
Topology - cont
When designing or analyzing an interconnection network, several areas stand out as being important. First, there is the matter of topology, that is, how the components are arranged. Other important aspects are: Second is how the switching works and how to handle contentio for resources. Third is what routing algorithm is used to get messages to their destination efficiently.
Flynns taxonomy
1.4 Types of Parallelism: A Taxonomy

Single data stream
Single instr stream
Multiple data streams
Shared variables
Global memory
Message passing
Uniprocessors Array or vector processors
Johnson s expansion
SISD
SIMD
Shared-memory multiprocessors
GMSV
Rarely used
GMMP
Multiple instr streams
Rarely used
MISD
Multiproc s or multicomputers
MIMD
Distributed memory
Distributed Distrib-memory shared memory multicomputers
DMSV
DMMP
Flynns categories
Fig. 1.11
The Flynn-Johnson classification of computer systems.
MIMD category
The MIMD category includes a wide class of computers. For this reason, in 1988, E. E. Johnson proposed a further classification of such machines based on

their memory structure (global or distributed) and the mechanism used for communication/synchronization (shared variables or message passing).
Again, one of the four categories (GMMP) is not widely used. The GMSV class is what is loosely referred to as (sharedmemory) multiprocessors. At the other extreme, the DMMP class is known as (distributed-memory) multicomputers.
MIMD category - cont
Finally, the DMSV class, which is becoming popular in view of combining the implementation ease of distributed memory with the programming ease of the shared-variable scheme, is sometimes called distributed shared memory. When all processors in a MIMD-type machine execute the same program, the result is sometimes referred to as single-program multipledata [SPMD (spim-dee)]. Although Fig. 1.11 lumps all SIMD machines together, there are in fact variations similar to those suggested above for MIMD machines. At least conceptually, there can be shared-memory and distributedmemory SIMD machines in which the processors communicate by means of shared variables or explicit message passing.
A taxonomy of parallel computers / Tanenbaum
Tanenbaum taxonomy
In Tanenebaum taxonomy, the MIMD category has been split into

Multiprocessors (shared-memory machines), and Multicomputers (message-passing machines).
Three kinds of multiprocessors exist, distinguished by the way the shared memory is implemented on them. They are called
UMA (Uniform Memory Access), NUMA (NonUniform Memory Access), and COMA (Cache Only Memory Access).
Tanenbaum taxonomy - cont
These categories exist because in large multiprocessors, the memory is usually split up into multiple modules. UMA machines have the property that each CPU has the same access time to every memory module.
In other words, every memory word can be read as fast as every other memory word. If this is technically impossible, the fastest references are slowed down to match the slowest ones, so programmers do not see the difference. This is what uniform means here. This uniformity makes the performance predictable, an important factor for writing efficient code.
In contrast, in a NUMA multiprocessor, this property does not hold. Often there is a memory module close to each CPU and accessing that memory module is much faster than accessing distant ones. The result is that for performance reasons, it matters where code and data are placed. COMA machines are also nonuniform,but in a different way. The other main category of MIMD machines consists of the multicomputers, which, unlike the multiprocessors, do not have shared primary memory at the architectural level. In other words, the operating system on a multicomputer CPU cannot access memory attached to a different CPU by just executing a LOAD instruction. It has to send an explicit message and wait for an answer. The ability of the operating system to read a distant word by just doing a LOAD is what distinguishes multiprocessors from multicomputers
As we mentioned before, even on a multicomputer, user programs may have the ability to access remote memory by using LOAD and STORE instructions, but this illusion is supported by the operating system, not the hardware. This difference is subtle, but very important. Because multicomputers do not have direct access to remote memory, they are sometimes called NORMA (NO Remote Memory Access) machines. Multicomputers can be roughly divided into two categories. The first category contains the MPPs (Massively Parallel Processors), which are expensive supercomputers consisting of many CPUs tightly coupled by a high-speed proprietary interconnection network. The Cray T3E and IBM SP/2 are wellknown examples.
The other category consists of regular PCs or workstations, possibly rackmounted, and connected by commercial off-the-shelf interconnection technology.
Logically, there is not much difference, but huge supercomputers costing many millions of dollars are used differently than networks of PCs assembled by the users for a fraction of the price of an MPP.
These home-brew machines go by various names, including NOW (Network of Workstations) and COW (Cluster of Workstations).
Sequential Consistency
(a) Two CPUs writing and two CPUs reading a common memoryword. (b) - (d) Three possible ways the two writes and four reads might be interleaved in time.
Weak Consistency
Weakly consistent memory uses synchronization operations to divide time into sequential epochs.
UMA Symmetric Multiprocessor Architectures
Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories.
Snooping Caches
The write through cache coherence protocol. The empty boxes indicate that no action is taken.
The MESI Cache Coherence Protocol
The MESI cache coherence protocol.
UMA Multiprocessors Using Crossbar Switches
(a) An 8 8 crossbar switch. (b) An open crosspoint. (c) A closed crosspoint.
UMA Multiprocessors Using Multistage Switching Networks (1)
(a) A 2 2 switch. (b) A message format.
UMA Multiprocessors Using Multistage Switching Networks (2) An omega switching network.
NUMA Multiprocessors
A NUMA machine based on two levels of buses. The Cm* was the first multiprocessor to use this design.
Cache Coherent NUMA Multiprocessors
(a) A 256-node directory-based multiprocessor. (b) Division o a 32-bit memory address into fields. (c) The directory at node 36.
The Sun Fire E25K NUMA Multiprocessor (1)
The Sun Microsystems E25K multiprocessor.
The Sun Fire E25K NUMA Multiprocessor (2)
The SunFire E25K uses a four-level interconnect. Dashed lines are address paths. Solid lines are data paths.
Message-Passing Multicomputers
A generic multicomputer.
BlueGene (1)
The BlueGene/L custom processor chip.
BlueGene (2)
The BlueGene/L. (a) Chip. (b) Card. (c) Board. (d) Cabinet. (e) System.
Red Storm (1)
Packaging of the Red Storm components.
Red Storm (2)
The Red Storm system as viewed from above.
A Comparison of BlueGene/L and Red Storm
A comparison of BlueGene/L and Red Storm.
Google (1)
Processing of a Google query.
Google (2)
A typical Google cluster.
Scheduling
Scheduling a cluster. (a) FIFO. (b) Without headof-line blocking. (c) Tiling. The shaded areas indicate idle CPUs.
Distributed Shared Memory (1)
A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (a) The initial situation. .
A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (b) After CPU 0 references page 10.
A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (c) After CPU 1 references page 10, here assumed to be a readonly page.
Linda
Three Linda tuples.
Orca
A simplified ORCA stack object, with internal data and two operations.
Software Metrics (1)
Real programs achieve less than the perfect speedup indicated by the dotted line.
Software Metrics (2)

(a) A program has a sequential part and a parallelizable part. (b) Effect of running part of the program in parallel.
Achieving High Performance
(a) A 4-CPU bus-based system. (b) A 16-CPU bus-based system. (c) A 4-CPU grid-based system. (d) A 16-CPU grid-based system.
Grid Computing
The grid layers.
Bibliografie
1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura Victor, ISBN 973-8128-05-6, 2000 2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw Hill, 1993, ISBN 0-07-113342-9. 3. Hesham El-Rewini, Southern Methodist University, Mostafa AbdEl-Barr Kuwait University, ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING, A JOHN WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5 4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-3 5. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall 2003, ISBN 0-13-145660+1

SMM Cap1

Uploaded by

Copyright:

Available Formats

SMM Cap1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SMM Cap1

Uploaded by

Copyright:

Available Formats

Titular curs: dr. ing. Ioan Ungurean [email protected].

Cum se face notarea?

Problemele int ale cursului

1.1 Why Parallel Processing?

Evolution of Computer Performance/Cost

Mental power in four scales

The Semiconductor Technology Roadmap

From the 2001 edition of the roadmap [Alla02]

Why High-Performance Computing?

The Speed-of-Light Argument

Why Do We Need TIPS or TFLOPS Performance?

The ASCI Program

ASCI Purple ASCI Q

ASCI White ASCI Blue

ASCI Red 2000 2005 2010

The Quest for Higher Performance

1. IBM Blue Gene/L

3. NEC Earth Sim

The Quest for Higher Performance: 2008 Update

2. IBM Blue Gene/L

3. Sun Blade X6420

Supercomputer Performance Growth

$240M MPPs $30M MPPs

What Exactly is Parallel Processing?

ABCs of Parallel Processing in One Slide A Amdahls Law (Speedup Formula)

Brents Scheduling Theorem

Multiple Pipeline Func. Units Implicit Vector Explicit Vector

Parallel Computer Architectures

The TriMedia VLIW CPU (1)

A typical TriMedia instruction, showing five possible operations.

The TriMedia VLIW CPU (2)

The TriMedia VLIW CPU (3)

The major groups of TriMedia custom operations.

The TriMedia VLIW CPU (4)

On-Chip Multithreading (1)

On-Chip Multithreading (2)

Hyperthreading on the Pentium 4

Resource sharing between threads in the Pentium 4 NetBurst microarchitecture.

Homogeneous Multiprocessors on a Chip

Heterogeneous Multiprocessors on a Chip (1)

Heterogeneous Multiprocessors on a Chip (2)

An example of the IBM CoreConnect architecture.

Introduction to Networking (1)

Introduction to Networking (2)

A packet as it appears on the Ethernet.

Introduction to Network Processors

The Nexperia Media Processor

The Nexperia heterogeneous multiprocessor on a chip.

(a) A multicomputer with 16 CPUs, each with its own private

Hybrid approach cont

Hybrid approach cont

One approach to building hybrid systems is based on the fact that

Hybrid approach cont

A second possibility is to use multicomputer hardware and have

Hybrid approach cont

A third possibility is to have a user-level runtime

1.4 Types of Parallelism: A Taxonomy

Multiple data streams