Data Parallel Architecture

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 17

DATA PARALLEL

1
ARCHITECTURE

SUBMITTED TO:- SUBMTTED BY:-


MR.SUMIT MITTU VIPAN KUMAR BAGGA
REG.NO:- 3010070218
ROLL NO:-33
BCA-A
DATA PARALLEL
2
ARCHITECTURE

ACKNOWLEDGEMENT

First & foremost I thanks my teacher who has assigned me this


term paper to bring out my creative capabilities.
I express my gratitude to my parents for being a continuous
source of encouragement & for all their financial aids given to
me.
I would like to acknowledge the assistance provided to me by the
library staff of LPU phagwara. My heartful gratitude to my
friends, roommates for helping me to complete my task in time.

VIPAN BAGGA
DATA PARALLEL
3
ARCHITECTURE

Table of Contents

1. WHAT IS PARALLEL COMPUTING?


2. WHY USE PARALLEL COMPUTING?
3. INTRODUCTION
4. VON NEUMANN ARCHITECTURE
5. FLYNN’S CLASSICAL TAXONOMY
6. PARALLEL COMPUTER MEMORY ARCHTECTURES
7. PROGRAMMING MODALS
8. SUMMARY
9. REFERENCE
DATA PARALLEL
4
ARCHITECTURE

What is Parallel Computing?


• Traditionally, software has been written for serial computation:
o To be run on a single computer having a single Central Processing
Unit (CPU).
o A problem is broken into a discrete series of instructions.
o Instructions are executed one after another.
o Only one instruction may execute at any moment in time.
• In the simplest sense, parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem.
o To be run using multiple CPUs
o A problem is broken into discrete parts that can be solved
concurrently
o Each part is further broken down to a series of instructions
o Instructions from each part execute simultaneously on different CPUs

• The computer resources can include:


o A single computer with multiple processors;
o An arbitrary number of computers connected by a network;
o A combination of both.
• The computational problem usually demonstrates characteristics such as the
ability to be:
o Broken apart into discrete pieces of work that can be solved
simultaneously;
o Execute multiple program instructions at any moment in time;
DATA PARALLEL
5
ARCHITECTURE

o Solved in less time with multiple compute resources than with a single
compute resource.
• Parallel computing is an evolution of serial computing that attempts to
emulate what has always been the state of affairs in the natural world: many
complex, interrelated events happening at the same time, yet within a
sequence. Some examples:
o Planetary and galactic orbits
o Weather and ocean patterns
o Tectonic plate drift
o Rush hour traffic in LA
o Automobile assembly line
o Daily operations within a business
o Building a shopping mall
o Ordering a hamburger at the drive through.
• Traditionally, parallel computing has been considered to be "the high end of
computing" and has been motivated by numerical simulations of complex
systems and "Grand Challenge Problems" such as:
o weather and climate
o chemical and nuclear reactions
o biological, human genome
o geological, seismic activity
o mechanical devices - from prosthetics to spacecraft
o electronic circuits
o manufacturing processes
• Today, commercial applications are providing an equal or greater driving
force in the development of faster computers. These applications require the
processing of large amounts of data in sophisticated ways. Example
applications include:
o parallel databases, data mining
o oil exploration
o web search engines, web based business services
o computer-aided diagnosis in medicine
o management of national and multi-national corporations
o advanced graphics and virtual reality, particularly in the entertainment
industry
o networked video and multi-media technologies
o collaborative work environments
• Ultimately, parallel computing is an attempt to maximize the infinite but
seemingly scarce commodity called time.
DATA PARALLEL
6
ARCHITECTURE

Why Use Parallel Computing?


• The primary reasons for using parallel computing:
o Solve larger problems
o Provide concurrency (do multiple things at the same time)
• Other reasons might include:
o Taking advantage of non-local resources - using available compute
resources on a wide area network, or even the Internet when local
compute resources are scarce.
o Cost savings - using multiple "cheap" computing resources instead of
paying for time on a supercomputer.
o Overcoming memory constraints - single computers have very finite
memory resources. For large problems, using the memories of
multiple computers may overcome this obstacle.
• Limits to serial computing - both physical and practical reasons pose
significant constraints to simply building ever faster serial computers:
o Transmission speeds - the speed of a serial computer is directly
dependent upon how fast data can move through hardware. Absolute
limits are the speed of light (30 cm/nanosecond) and the transmission
limit of copper wire (9 cm/nanosecond). Increasing speeds necessitate
increasing proximity of processing elements.
o Limits to miniaturization - processor technology is allowing an
increasing number of transistors to be placed on a chip. However,
even with molecular or atomic-level components, a limit will be
reached on how small components can be.
o Economic limitations it is increasingly expensive to make a single
processor faster. Using a larger number of moderately fast commodity
processors to achieve the same (or better) performance is less
expensive.
• The future: during the past 10 years, the trends indicated by ever faster
networks, distributed systems, and multi-processor computer architectures
(even at the desktop level) clearly show that parallelism is the future of
computing.
DATA PARALLEL
7
ARCHITECTURE

Who and What?

• Top500.org provides statistics on parallel computing users - the charts below


are just a sample. Some things to note:
o Sectors may overlap - for example, research may be classified
research. Respondents have to choose between the two."Not specified" is by
far the largest application - probably means multiple applications.

o
DATA PARALLEL
8
ARCHITECTURE

I. INTRODUCTION

Processor performance has been increasing more rapidly than memory


performance making memory bandwidth and latency the bottlenecks in many
applications.The performance of modern DRAMs is very sensitive to access
patterns due to their organization as multiple banks where each bank is a 2-
dimensional memory array.Memory systems consisting of several address-
interleaved channels of DRAMs are even more sensitive to access patterns.To
achieve high performance an access pattern must have sufficient parallelism to
tolerate memory latency while keeping the memory bandwidth of all the banks and
channels occupied.The access pattern must also exhibit locality to minimize
activate/precharge cycles, avoid bank conflicts, and avoid read/write turnaround
penalties.Multimedia and scientific applications have access patterns that contain
multiple streams or threads of accesses.For example in a vector or stream processor
each vector or stream load or store is a thread of related memory references.Also in
a DSP with software managed local memory,each DMA transfer to or from local
memory can be thought of a thread of memory references.A data parallel memory
system can exploit parallelism both within a thread by generating several thread
accesses per cycle or cross threads by generating accesses from different threads in
parallel.Exploiting parallelism across threads may result in frequent read/write turn
around and numerous precharge and Activate cycles as there is little locality
between two threads. Memory access scheduling [7] addresses this performance
degradation by reordering memory references to enhance locality.In this paper we
investigate an alternative and complementary method of improving memory
locality in data parallel systems: using only a single, wide address generator
(AG).With this approach, all accesses from one thread are sent to the memory
system before any accesses from another thread are generated. This enhances
locality — by eliminating interference between threads — at the possible expense
of load balance if the single thread does not access the memory channels (MCs)
and banks evenly. This single thread approach can be used with memory access
scheduling to further enhance locality. Compared to an approach using two
narrower AGs, the single AG configuration gives an average 17% performance
improvement in memory trace execution on a set of multimedia and scientific
applications.The rest of this paper is organized as follows. Section II summarizes
the characteristics of modern DRAM architecture,Section III details micro-
DATA PARALLEL
9
ARCHITECTURE

architecture of data parallel memory system, Section IV provides experimental


setup, Section V analyzes the microbenchmark and application memory trace
results, and Section VI concludes the paper.

Von Neumann Architecture


• For over 40 years, virtually all computers have followed a common machine
model known as the von Neumann computer. Named after the Hungarian
mathematician John von Neumann.
• A von Neumann computer uses the stored-program concept. The CPU
executes a stored program that specifies a sequence of read and writes
operations on the memory.

• Basic design:
o Memory is used to store both program and data instructions
o Program instructions are coded data which tell the computer to do
something
o Data is simply information to be used by the program
o A central processing unit (CPU) gets instructions and/or data from
memory, decodes the instructions and then sequentially performs
them.

Flynn's Classical Taxonomy

• There are different ways to classify parallel computers. One of the more
widely used classifications, in use since 1966, is called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-processor computer architectures
according to how they can be classified along the two independent
dimensions of Instruction and Data. Each of these dimensions can have
only one of two possible states: Single or Multiple.
DATA PARALLEL
10
ARCHITECTURE

• The matrix below defines the 4 possible classifications according to Flynn.

SISD SIMD

Single Instruction, Single Single Instruction, Multiple


Data Data

MISD MIMD

Multiple Instruction, Single Multiple Instruction, Multiple


Data Data

Single Instruction, Single Data (SISD):


• A serial (non-parallel) computer
• Single instruction: only one instruction stream is
being acted on by the CPU during any one clock
cycle
• Single data: only one data stream is being used as
input during any one clock cycle
• Deterministic execution
• This is the oldest and until recently, the most
prevalent form of computer

• Examples: most PCs, single CPU workstations and


mainframes
Single Instruction, Multiple Data
(SIMD):

• A type of parallel computer


• Single instruction: All processing
units execute the same instruction
at any given clock cycle
• Multiple data: Each processing unit
can operate on a different data
DATA PARALLEL
11
ARCHITECTURE

element
• This type of machine typically has
an instruction dispatcher, a very
high-bandwidth internal
network,and a very large array of
very small-capacity instruction
units.
• Best suited for specialized
problems characterized by a high
degree of regularity, such as image
processing.
• Synchronous (lockstep) and
deterministic execution
• Two varieties: Processor Arrays
and Vector Pipelines
• Examples:
o Processor Arrays:
Connection Machine CM-2,
Maspar MP-1, MP-2

 Vector
pipelines: IBM 9000,
Cray C90, Fujitsu VP,
EC SX-2, HitachiS820
Multiple Instruction, Single Data (MISD):
• A single data stream is fed into multiple processing units.
• Each processing unit operates on the data independently via independent instruction
streams.
• Few actual examples of this class of parallel computer have ever existed. One is the
experimental Carnegie-Mellon C.mmp computer (1971).
• Some conceivable uses might be:
o multiple frequency filters operating on a single signal stream
DATA PARALLEL
12
ARCHITECTURE

o
o
o multiple cryptography algorithms attempting to crack a single coded
message.

Multiple Instruction,
Multiple Data (MIMD):

• Currently, the most common


type of parallel computer.
Most modern computers fall
into this category.
• Multiple Instruction: every
processor may be executing a
different instruction stream

• Multiple Data: every


processor may be
working with a different
data stream
• Execution can be
synchronous or
asynchronous,
deterministic or non-
deterministic

• Examples: most current


supercomputers,
networked parallel
DATA PARALLEL
13
ARCHITECTURE

computer "grids" and


multi-processor SMP
computers - including
some types of PCs.

Parallel Computer Memory Architectures

Shared Memory

General Characteristics: Shared memory parallel computers vary widely, but


generally have in common the ability for all processors to access all memory as

global address space.

• Multiple processors can operate independently but share the same memory
resources.
• Changes in a memory location effected by one processor are visible to all
other processors.
• Shared memory machines can be divided into two main classes based upon
memory access times: UMA and NUMA.

Uniform Memory Access (UMA):


DATA PARALLEL
14
ARCHITECTURE

• Most commonly represented today by Symmetric Multiprocessor (SMP)


machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means
if one processor updates a location in shared memory, all the other
processors know about the update. Cache coherency is accomplished at the
hardware level.

Non-Uniform Memory Access (NUMA):

• Often made by physically linking two or more SMPs


• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-NUMA -
Cache Coherent NUMA

Advantages:

• Global address space provides a user-friendly programming perspective to


memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs

Disadvantages:

• Primary disadvantage is the lack of scalability between memory and CPUs.


Adding more CPUs can geometrically increases traffic on the shared
memory-CPU path, and for cache coherent systems, geometrically increase
traffic associated with cache/memory management.
• Programmer responsibility for synchronization constructs that insure
"correct" access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and
produce shared memory machines with ever increasing numbers of
processors.
DATA PARALLEL
15
ARCHITECTURE

Parallel Computer Memory Architectures

Distributed Memory

General Characteristics:

• Like shared memory systems, distributed memory systems vary widely but
share a common characteristic. Distributed memory systems require a
communication network to connect inter-processor memory.

• Processors have their own local memory. Memory addresses in one


processor do not map to another processor, so there is no concept of global
address space across all processors.
• Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the
task of the programmer to explicitly define how and when data is
communicated. Synchronization between tasks is likewise the programmer's
responsibility.
• The network "fabric" used for data transfer varies widely, though it can can
be as simple as Ethernet.

Advantages:

• Memory is scalable with number of processors. Increase the number of


processors and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
16 DATA PARALLEL
ARCHITECTURE

• Cost effectiveness: can use


commodity, off-the-shelf
processors and networking.

Disadvantages:

• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory,
to this memory organization.
• Non-uniform memory access (NUMA) times

Parallel Programming Models

Other Models

• Other parallel programming models besides those previously mentioned


certainly exist, and will continue to evolve along with the ever changing
world of computer hardware and software. Only three of the more common
ones are mentioned here.

Single Program Multiple Data (SPMD):

• SPMD is actually a "high level" programming model that can be built upon
any combination of the previously mentioned parallel programming models.
• A single program is executed by all tasks simultaneously.
• At any moment in time, tasks can be executing the same or different
instructions within the same program.
• SPMD programs usually have the necessary logic programmed into them to
allow different tasks to branch or conditionally execute only those parts of
the program they are designed to execute. That is, tasks do not necessarily
have to execute the entire program - perhaps only a portion of it.
• All tasks may use different data

Multiple Program Multiple Data (MPMD):

• Like SPMD, MPMD is actually a "high level" programming model that can
be built upon any combination of the previously mentioned parallel
programming models.
DATA PARALLEL
17
ARCHITECTURE

• MPMD applications typically have multiple executable object files


(programs). While the application is being run in parallel, each task can be
executing the same or different program as other tasks.
• All tasks may use different data

SUMMARY
This presentation covers the basics of parallel computing. Beginning with a brief
overview and some concepts and terminology associated with parallel computing,
the topics of parallel memory architectures and programming models are then
explored. These topics are followed by a discussion on a number of issues related
to designing parallel programs. The last portion of the presentation is spent
examining how to parallelize several different types of serial programs.

REFERENCE

• ADVANCED COMPUTER ARCHITECTURE


• COMPUTER SYSTEM ARCHITECTURE
• NET GOOGLE SEARCH
• WIKIPEDIA

You might also like