SMM Cap1
SMM Cap1
SMM Cap1
ro
Durat:
14 sptmni x 3 ore
Nota la examen NF = 45%NE + 45%NL + 10%PC NF = Nota Final NE = Nota la Examen NL = Nota la Laborator PC = Prezena la Curs Examenul se desfoar n scris. Se dau trei subiecte sub form de probleme Prezena la curs poate fi compensat cu o tem suplimentar legat de curs, cu aprobarea cadrului didactic. Nota la examen se calculeaz cu activitile executate i notate i trebuie s fie de cel puin 45% (4,5). Se dau bonusuri de pn 10% pentru activiti suplimentare (participarea la concursuri, conferine studeneti, proiecte, etc.)
Bibliografie
1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura Victor, ISBN 973-8128-05-6, 2000 2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw Hill, 1993, ISBN 0-07-113342-9. 3. Hesham El-Rewini, Southern Methodist University, Mostafa AbdEl-Barr Kuwait University, ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING, A JOHN WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5 4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-3 5. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall 2003, ISBN 0-13-145660+1
Arhitecturi multiprocesor Magistrale de interconectare PCI, PCIe, Infiniband Chipuri multimicroprocesor Intel Xeon, Cell BE Cluster computing Grid computing
Introducere
Processor performance
1.6 / yr GIPS
Pentium II 68040 R10000 Pentium 80486 80386
MIPS
68000 80286
KIPS 1980
1990
2000
2010
Calendar year
Fig. 1.1 The exponential growth of microprocessor performance, known as Moores Law, shown over the past two decades (extrapolated).
From: Robots After All, by H. Moravec, CACM, pp. 90-97, October 2003.
Halfpitch (nm)
Clock freq. (GHz) Wiring levels Power supply (V) Max. power (W)
140
2 7 1.1 130
90
4 8 1.0 160
65
7 9 0.8 190
45
12 10 0.7 220
32
20 10 0.6 250
22
30 10 0.5 290
Factors contributing to the validity of Moores law Denser circuits; Architectural improvements Measures of processor performance Instructions/second (MIPS, GIPS, TIPS, PIPS) Floating-point operations per second (MFLOPS, GFLOPS, TFLOPS, PFLOPS) Running time on benchmark suites
Processor performance
1.6 / yr GIPS
Pentium II 68040 R10000 Pentium 80486 80386
MIPS
68000 80286
KIPS 1980
1990
2000
2010
Calendar year
Higher speed (solve problems faster) Important when there are hard or soft deadlines; e.g., 24-hour weather forecast Higher throughput (solve more problems) Important when there are many similar tasks to perform; e.g., transaction processing Higher computational power (solve larger problems) e.g., weather forecast for a week rather than 24 hours, or with a finer mesh for greater accuracy
Categories of supercomputers Uniprocessor; aka vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors)
The speed of light is about 30 cm/ns. Signals travel at a fraction of speed of light (say, 1/3). If signals must travel 1 cm during the execution of an instruction, that instruction will take at least 0.1 ns; thus, performance will be limited to 10 GIPS. This limitation is eased by continued miniaturization, architectural methods such as cache memory, etc.; however, a fundamental limit does exist. How does parallel processing help? Wouldnt multiple processors need to communicate via signals as well?
Reasonable running time = Fraction of hour to several hours (103-104 s) In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations
Example 1: Southern oceans heat Modeling (10-minute iterations) 300 GFLOP per iteration 300 000 iterations per 6 yrs = 1016 FLOP
Example 2: Fluid dynamics calculations (1000 1000 1000 lattice) 109 lattice points 1000 FLOP/point 10 000 time steps = 1016 FLOP Example 3: Monte Carlo simulation of nuclear reactor 1011 particles to track (for 1000 escapes) 104 FLOP/particle = 1015 FLOP Decentralized supercomputing ( from Mathworld News, 2006/4/7 ): Grid of tens of thousands networked computers discovers 230 402 457 1, the 43rd Mersenne prime, as the largest known prime (9 152 052 digits )
Performance (TFLOPS)
100
100+ TFLOPS, 20 TB
30+ TFLOPS, 10 TB
10+ TFLOPS, 5 TB
10
3+ TFLOPS, 1.5 TB
ASCI
1 1995
Calendar year Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &) Computing Initiative (ASCI) program, sponsored by the US Department of Energy, with extrapolation up to the PFLOPS level.
2. SGI Columbia
NASA Ames, California
Material science, nuclear stockpile sim 32,768 procs, 8 TB, 28 TB disk storage Linux + custom OS 71 TFLOPS, $100 M Dual-proc Power-PC chips (10-15 W power) Full system: 130k-proc, 360 TFLOPS (est)
Aerospace/space sim, climate research 10,240 procs, 20 TB, 440 TB disk storage Linux 52 TFLOPS, $50 M 20x Altix (512 Itanium2) linked by Infiniband
Atmospheric, oceanic, and earth sciences 5,120 procs, 10 TB, 700 TB disk storage Unix 36 TFLOPS*, $400 M? Built of custom vector microprocessors Volume = 50x IBM, Power = 14x IBM
* Led the top500 list for 2.5 yrs
1. IBM Roadrunner
LANL, New Mexico Nuclear stockpile calculations, and more 122,400 procs, 98 TB, 0.4 TB/s file system I/O Red Hat Linux 1.38 PFLOPS, $130M PowerXCell 8i 3.2 GHz, AMD Opteron (hybrid) 2.35 MW power, expands to 1M procs
* Actually 4th on top-500 list, with the 3rd being another IBM Blue Gene system at 0.557 PFLOPS
Supercomputer performance
TFLOPS
CM-2
CM-5 CM-5
Vector supers
Micros Y-MP
GFLOPS
Alpha Cray X-MP 80860 80386
MFLOPS 1980
1990
2000
2010
Calendar year
Fig. 1.2 The exponential growth in supercomputer performance over the past two decades (from [Bell92], with ASCI performance goals and microprocessor peak FLOPS superimposed as dotted lines).
Parallelism = Concurrency Doing more than one thing at a time Has been around for decades, since early computers I/O channels, DMA, device controllers, multiple ALUs The sense in which we use it in this course Multiple agents (hardware units, software processes) collaborate to perform our main computational task - Multiplying two matrices - Breaking a secret code - Deciding on the next chess move
Cost-Effectiveness Adage
Real news The most cost-effective parallel solution may not be the one with highest peak OPS (communication?), greatest speed-up (at what cost?), or best utilization (hardware busy doing what?). Analogy: Mass transit might be more cost-effective than private cars even if it is slower and leads to many empty seats.
Historical Perspective
Types of Parallelism
Note: many overlaps
lookahead & pipelining vectorization concurrency & simultaneity data and control parallelism partitioning & specialization interleaving & overlapping of physical subsystems multiplicity & replication time & space sharing multitasking & multiprogramming multi-threading distributed computing - for speed or availability
Scalar
Sequential Lookahead
I/E Overlap
Functional Parallelism
Architectural
Memory-tomemory SIMD
Register-toregister MIMD
Evolution
Associative Processor
Multicomputer
Processor Array Mutiprocessor
(a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A multicomputer. (e) A grid.
Instruction-Level Parallelism
(a) A CPU pipeline. (b) A sequence of VLIW instructions. (c) An instruction stream with bundles marked.
(a) An array of 8-bit elements. (b) The transposed array. (c) The original array fetched into four registers. (d) The transposed array in four registers.
(a) (c) Three threads. The empty boxes indicated that the thread has stalled waiting for memory. (d) Fine-grained multithreading. (e) Coarse-grained multithreading.
Multithreading with a dual-issue superscalar CPU. (a) Fine-grained multithreading. (b) Coarse-grained multithreading. (c) Simultaneous multithreading.
Single-chip multiprocessors. (a) A dual-pipeline chip. (b) A chip with two cores.
The logical structure of a simple DVD player contains a heterogeneous multiprocessor containing multiple cores for different functions.
Multiprocessors
(a) A multiprocessor with 16 CPUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different CPU.
Multicomputers (1)
Multicomputers (2)
Various layers where shared memory can be implemented. (a) The hardware. (b) The operating system. (c) The language runtime system.
Hybrid approach
In short, programming a multicomputer is much more difficult than programming a multiprocessor. Under these conditions, why would anyone build multicomputers, when multiprocessors are easier to program? The answer is simple: large multicomputers are much simpler and cheaper to build than multiprocessors with the same number of CPUs. Implementing a memory shared by even a few hundred CPUs is a substantial undertaking, whereas building a multicomputer with 10,000 CPUs or more is straightforward.
Thus we have a dilemma: multiprocessors are hard to build but easy to program whereas multicomputers are easy to build but hard to program. This observation has led to a great deal of effort to construct hybrid systems that are relatively easy to build and relatively easy to program. This work has led to the realization that shared memory can be implemented in various ways, each with its own set of advantages and disadvantages. In fact, much of the research in parallel architectures these days relates to the convergence of multiprocessor and multicomputer architectures into hybrid forms that combine the strengths of each. The holy grail here is to find designs that are scalable, that is, continue to perform well as more and more CPUs are added.
modern computer systems are not monolithic but are constructed as a series of layers. This insight opens the possibility of implementing the shared memory at any one of several layers, as shown in Fig. 8-3. In Fig. 8-3(a) we see the shared memory being implemented by the hardware as a true multiprocessor. In this design, there is a single copy of the operating system with a single set of tables, in particular, the memory allocation table. When a process needs more memory, it traps to the operating system, which then looks in its table for a free page and maps the page into the callers address space. As far as the operating system is concerned, there is a single memory and it keeps track of which process owns which page in software. There are many ways to implement hardware shared memory, as we will see later.
the operating system simulate shared memory by providing a single system-wide paged shared virtual address space. In this approach, called DSM (Distributed Shared Memory) (Li, 1988; and Li and Hudak, 1986, 1989), each page is located in one of the memories of Fig. 8-2(a). Each machine has its own virtual memory and its own page tables. When a CPU does a LOAD or STORE on a page it does not have, a trap to the operating system occurs. The operating system then locates the page and asks the CPU currently holding it to unmap the page and send it over the interconnection network. When it arrives, the page is mapped in and the faulting instruction is restarted. In effect, the operating system is just satisfying page faults from remote memory instead of from disk. To the user, the machine looks as if it has shared memory.
system implement a (possibly language-specific) form of shared memory. In this approach, the programming language provides some kind of shared memory abstraction, which is then implemented by the compiler and runtime system. For example, the Linda model is based on the abstraction of a shared space of tuples (data records containing a collection of fields). Processes on any machine can input a tuple from the shared tuple space or output a tuple to the shared tuple space. Because access to the tuple space is controlled entirely in software (by the Linda runtime system), no special hardware or operating system support is needed.
Topology
Various topologies. The heavy dots represent switches. The CPUs and memories are not shown. (a) A star. (b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.
Topology - cont
When designing or analyzing an interconnection network, several areas stand out as being important. First, there is the matter of topology, that is, how the components are arranged. Second is how the switching works and how to handle contention for resources. Third is what routing algorithm is used to get messages to their destination efficiently.
Topology - cont
The topology of an interconnection network describes how the links and switches are arranged, for example, as a ring or as a grid. Topological designs can be modeled as graphs, with the links as arcs and the switches as nodes, as shown in Fig. 8-4. Each node in an interconnection network (or its graph) has some number of links connected to it. Mathematicians call the number of links the degree of the node; engineers call it the fanout. In general, the greater the fanout, the more routing choices there are and the greater the fault tolerance, that is, the ability to continue functioning even if a link fails by routing around it. If every node has k arcs and the wiring is done right, it is possible to design the network so that it remains fully connected even if k - 1 links fail.
Topology - cont
Another property of an interconnection network (or its graph) is its diameter. If we measure the distance between two nodes by the number of arcs that have to be traversed to get from one to the other, then the diameter of a graph is the distance between the two nodes that are the farthest apart (i.e., have the greatest distance between them). The diameter of an interconnection network is related to the worst-case delay when sending packets from CPU to CPU or from CPU to memory because each hop across a link takes a finite amount of time. The smaller the diameter is, the better the worst-case performance is. Also important is the average distance between two nodes, since this relates to the average packet transit time.
Topology - cont
Yet another important property of an interconnection network is its transmission capacity, that is, how much data it can move per second. One useful measure of this capacity is the bisection bandwidth. To compute this quantity, we first have to (conceptually) partition the network into two equal (in terms of number of nodes) but unconnected parts by removing a set of arcs from its graph. Then we compute the total bandwidth of the arcs that have been removed. There may be many different ways to partition the network into two equal parts. The bisection bandwidth is the minimum of all the possible partitions. The significance of this number is that if the bisection bandwidth is, say, 800 bits/sec, then if there is a lot of communication between the two halves, the total throughput may be limited to only 800 bits/sec, in the worst case. Many designers believe bisection bandwidth is the most important metric of in interconnection network. Many interconnection networks are designed with the goal of maximizing the bisection bandwidth.
Topology - cont
Interconnection networks can be characterized by their dimensionality. For our purposes, the dimensionality is determined by the number of choices there are to get from the source to the destination. If there is never any choice (i.e., there is only one path from each source to each destination), the network is zero dimensional. If there is one dimension in which a choice can be made, for example, go east or go west, the network is one dimensional. If there are two axes, so a packet can go east or west or alternatively, go north or south, the network is two dimensional, and so on.
Topology - cont
Several topologies are shown in Fig. 8-4 . Only the links (lines) and switches (dots) are shown here. The memories and CPUs (not shown) would typically be attached to the switches by interfaces. In Fig. 8-4(a), we have a zero-dimensional star configuration, in which the CPUs and memories would be attached to the outer nodes, with the central one just doing switching.
Although a simple design, for a large system, the central switch is likely to be a major bottleneck. Also, from a fault-tolerance perspective, this is a poor design since a single failure at the central switch completely destroys the system.
Topology - cont
In Fig. 8-4(b), we have another zerodimensional design that is at the other end of the spectrum, a full interconnect. Here every node has a direct connection to every other node.
This design maximizes the bisection bandwidth, minimizes the diameter, and is exceedingly fault tolerant (it can lose any six links and still be fully connected). Unfortunately, the number of links required for k nodes is k(k - 1)/2, which quickly gets out of hand for large k.
Topology - cont
Yet a third zero-dimensional topology is the tree, illustrated in Fig. 8-4(c). A problem with this design is that the bisection bandwidth is equal to the link capacity. Since there will normally be a lot of traffic near the top of the tree, the top few nodes will become bottlenecks. One way around this problem is to increase the bisection bandwidth by giving the upper links more bandwidth. For example, the lowest level links might have a capacity b, the next level might have a capacity 2b and the toplevel links might each have 4b. Such a design is called a fat tree and has been used in commercial multicomputers, such as the (now-defunct) Thinking Machines CM-5.
Topology - cont
The ring of Fig. 8-4(d) is a one-dimensional topology by our definition because every packet sent has a choice of going left or going right. The grid or mesh of Fig. 8-4(e) is a two-dimensional design that has been used in many commercial systems. It is highly regular, easy to scale up to large sizes, and has a diameter that only increases as the square root of the number of nodes. A variant on the grid is the double torus of Fig. 8-4(f), which is a grid with the edges connected. Not only is it more fault tolerant than the grid, but the diameter is also less because the opposite corners can now communicate in only two hops.
Topology - cont
The cube of Fig. 8-4(g) is a regular three-dimensional topology. We have illustrated a 2 2 2 cube, but in the general case it could be a k k k cube. In Fig. 8-4(h) we have a four-dimensional cube constructed from two threedimensional cubes with the corresponding edges connected. We could make a five-dimensional cube by cloning the structure of Fig. 8-4(h) and connecting the corresponding nodes to form a block of four cubes. To go to six dimensions, we could replicate the block of four cubes and interconnect the corresponding nodes, and so on.
Topology - cont
An n-dimensional cube formed this way is called a hypercube. Many parallel computers use this topology because the diameter grows linearly with the dimensionality. Put in other words, the diameter is the base 2 logarithm of the number of nodes, so, for example, a 10dimensional hypercube has 1024 nodes but a diameter of only 10, giving excellent delay properties. Note that in contrast, 1024 nodes arranged as a 32 32 grid has a diameter of 62, more than six times worse than the hypercube. The price paid for the smaller diameter is that the fanout and thus the number of links (and the cost) is much larger for the hypercube. Nevertheless, the hypercube is a common choice for high-performance systems.
Topology - cont
When designing or analyzing an interconnection network, several areas stand out as being important. First, there is the matter of topology, that is, how the components are arranged. Other important aspects are: Second is how the switching works and how to handle contentio for resources. Third is what routing algorithm is used to get messages to their destination efficiently.
Flynns taxonomy
Shared variables
Global memory
Message passing
Johnson s expansion
SISD
SIMD
Shared-memory multiprocessors
GMSV
Rarely used
GMMP
Rarely used
MISD
Multiproc s or multicomputers
MIMD
Distributed memory
DMSV
DMMP
Flynns categories
Fig. 1.11
MIMD category
The MIMD category includes a wide class of computers. For this reason, in 1988, E. E. Johnson proposed a further classification of such machines based on
their memory structure (global or distributed) and the mechanism used for communication/synchronization (shared variables or message passing).
Again, one of the four categories (GMMP) is not widely used. The GMSV class is what is loosely referred to as (sharedmemory) multiprocessors. At the other extreme, the DMMP class is known as (distributed-memory) multicomputers.
Finally, the DMSV class, which is becoming popular in view of combining the implementation ease of distributed memory with the programming ease of the shared-variable scheme, is sometimes called distributed shared memory. When all processors in a MIMD-type machine execute the same program, the result is sometimes referred to as single-program multipledata [SPMD (spim-dee)]. Although Fig. 1.11 lumps all SIMD machines together, there are in fact variations similar to those suggested above for MIMD machines. At least conceptually, there can be shared-memory and distributedmemory SIMD machines in which the processors communicate by means of shared variables or explicit message passing.
Tanenbaum taxonomy
Three kinds of multiprocessors exist, distinguished by the way the shared memory is implemented on them. They are called
UMA (Uniform Memory Access), NUMA (NonUniform Memory Access), and COMA (Cache Only Memory Access).
These categories exist because in large multiprocessors, the memory is usually split up into multiple modules. UMA machines have the property that each CPU has the same access time to every memory module.
In other words, every memory word can be read as fast as every other memory word. If this is technically impossible, the fastest references are slowed down to match the slowest ones, so programmers do not see the difference. This is what uniform means here. This uniformity makes the performance predictable, an important factor for writing efficient code.
In contrast, in a NUMA multiprocessor, this property does not hold. Often there is a memory module close to each CPU and accessing that memory module is much faster than accessing distant ones. The result is that for performance reasons, it matters where code and data are placed. COMA machines are also nonuniform,but in a different way. The other main category of MIMD machines consists of the multicomputers, which, unlike the multiprocessors, do not have shared primary memory at the architectural level. In other words, the operating system on a multicomputer CPU cannot access memory attached to a different CPU by just executing a LOAD instruction. It has to send an explicit message and wait for an answer. The ability of the operating system to read a distant word by just doing a LOAD is what distinguishes multiprocessors from multicomputers
As we mentioned before, even on a multicomputer, user programs may have the ability to access remote memory by using LOAD and STORE instructions, but this illusion is supported by the operating system, not the hardware. This difference is subtle, but very important. Because multicomputers do not have direct access to remote memory, they are sometimes called NORMA (NO Remote Memory Access) machines. Multicomputers can be roughly divided into two categories. The first category contains the MPPs (Massively Parallel Processors), which are expensive supercomputers consisting of many CPUs tightly coupled by a high-speed proprietary interconnection network. The Cray T3E and IBM SP/2 are wellknown examples.
The other category consists of regular PCs or workstations, possibly rackmounted, and connected by commercial off-the-shelf interconnection technology.
Logically, there is not much difference, but huge supercomputers costing many millions of dollars are used differently than networks of PCs assembled by the users for a fraction of the price of an MPP.
These home-brew machines go by various names, including NOW (Network of Workstations) and COW (Cluster of Workstations).
Sequential Consistency
(a) Two CPUs writing and two CPUs reading a common memoryword. (b) - (d) Three possible ways the two writes and four reads might be interleaved in time.
Weak Consistency
Weakly consistent memory uses synchronization operations to divide time into sequential epochs.
Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories.
Snooping Caches
The write through cache coherence protocol. The empty boxes indicate that no action is taken.
UMA Multiprocessors Using Multistage Switching Networks (2) An omega switching network.
NUMA Multiprocessors
A NUMA machine based on two levels of buses. The Cm* was the first multiprocessor to use this design.
(a) A 256-node directory-based multiprocessor. (b) Division o a 32-bit memory address into fields. (c) The directory at node 36.
The SunFire E25K uses a four-level interconnect. Dashed lines are address paths. Solid lines are data paths.
Message-Passing Multicomputers
A generic multicomputer.
BlueGene (1)
The BlueGene/L custom processor chip.
BlueGene (2)
The BlueGene/L. (a) Chip. (b) Card. (c) Board. (d) Cabinet. (e) System.
Google (1)
Google (2)
Scheduling
Scheduling a cluster. (a) FIFO. (b) Without headof-line blocking. (c) Tiling. The shaded areas indicate idle CPUs.
A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (a) The initial situation. .
A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (b) After CPU 0 references page 10.
A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (c) After CPU 1 references page 10, here assumed to be a readonly page.
Linda
Orca
A simplified ORCA stack object, with internal data and two operations.
Real programs achieve less than the perfect speedup indicated by the dotted line.
(a) A 4-CPU bus-based system. (b) A 16-CPU bus-based system. (c) A 4-CPU grid-based system. (d) A 16-CPU grid-based system.
Grid Computing
The grid layers.
Bibliografie
1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura Victor, ISBN 973-8128-05-6, 2000 2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw Hill, 1993, ISBN 0-07-113342-9. 3. Hesham El-Rewini, Southern Methodist University, Mostafa AbdEl-Barr Kuwait University, ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING, A JOHN WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5 4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-3 5. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall 2003, ISBN 0-13-145660+1