Parallel & Distributed Computing Report

Parallel & Distributed Computing
Report:
(CUDA)
Compute Unified Device Architecture
Group members:
Muhammad Anas 1812102

Alishan Nadeem 1812120
Asim Ibrahim 1812104
Hafiz Umer Ali 1812112
Introduction:
CUDA is a general-purpose computing framework designed by nVidia to run on graphics

hardware. The G80, G92, and other future chips based on the unified architecture will be able to
use CUDA. This category includes several GeForce and Quadro models.
Tesla devices were designed with general-purpose computing in mind. The creation of
general-purpose apps on GPUs has been greatly simplified thanks to this new technique.
Because of the low cost of this technology, graphics accelerators are now available on desktop
PCs, and we may expect widespread adoption of the CUDA architecture among developers.
CUDA ARCHITECTURE:
CUDA may be easily incorporated into current IT infrastructures. A software development kit
and a C compiler are included with CUDA. The well-known environment has a C compiler built
in. Under Linux and Windows operating systems, this solution is compatible with AMD/INTEL
x86 and x64 microprocessors. Over PCI express x16, a graphics accelerator is connected to a
host machine (host). This interconnection has a speed of 8 GB/s (4 GB/s upstream and 4 GB/s
downstream).
CUDA broadens the scope of general-purpose computing on GPU. In this situation, a parallel
coprocessor is referred to as a GPU. GPUs process massive amounts of data simultaneously.
The CPU is in charge of managing, presenting, and arranging the computing activity. Complex
datasets are broken down into smaller, independent pieces and processed using the same
instructions. To put it another way, data can be processed in parallel. CUDA is made up of a few
basic components. These components are responsible for simultaneous data processing.
CUDA driver, CUDA API, and CUDA mathematics libraries (CUBLAS and CUFFT) are the
components in question. The C compiler greatly facilitates the building of parallel applications. A
developer can devote his or her entire attention to the development of an application.
He does not need to modify a problem in order for it to be implemented using the graphics API.
Developers can write CPU and GPU code as a single unit. Expansion directives tell the compiler
which parts should be handled by the GPU and which should be handled by the CPU. The code
is first compiled with the CUDA GPU compiler, and then with the regular C compiler. This
programming paradigm enables us to run the code on many graphics accelerator versions. The
number of stream processors in use is irrelevant to the programmer. This is how the GPU
programme is realized. On the host machine, the programme is run. The CUDA driver runs code
on the GPU automatically. Through a high-speed interface, the CPU half of the software
communicates with the GPU part.
Special actions in a driver enable communication between the CPU and the GPU. A
programmer is not in charge of managing computing resources. Applications that require
parallel computational capability can benefit from CUDA paired with a massively parallel GPU.
COMPOSITION OF PARALLEL COMPUTATION WITH CUDA:

A thread is a fundamental component of parallel data processing. Threads are used to partition
the overall dataset. Threads are grouped, and a parallel programme (kernel) is used to process
the results of the groups. With thousands of threads in process, we can get an efficient
acceleration.
A block is a collection of threads that can effectively collaborate and share data via shared
cache memory. A single kernel is used to handle threads in the same block. Each thread in the
block has a unique identifier. In other words, the ID identifies the thread's placement within the
block. Is the block regarded as a 2D or 3D array in the case of thread location? (x+Dx) is the
thread ID for the (x,y) position in the 2D array.
MEMORY MODEL OF CUDA ARCHITECTURE :

The shared memory model is used by the CUDA architecture. Memory transfers between the
HOST and the DEVICE are made possible by expansion instructions. Memory operations are
kept on the chip thanks to local shared memory and registers. The graphics accelerator's
memory is split into several sections.
Local thread operations are performed using local memory and registers. The shared memory is
utilized for thread-to-thread communication. The communication between the blocks and grids is
handled by the global memory.
The HOST memory is not accessible to the thread. Data is copied to DEVICE's global memory if
we need to access this area of the memory. It is also possible to transfer in the reverse
direction. The shared memory model is provided by CUDA. On the HOST, read and write
memory operations are the same on DEVICE. The shared memory approach allows for faster
memory operations and more efficient thread communication. As a result, the application
becomes less reliant on DRAM bandwidth.
HARDWARE IMPLEMENTATION OF CUDA ARCHITECTURE:

The DEVICE is a collection of multiprocessors SIMD (Single Instruction Multi Data) architecture
is used in each multiprocessor.
Every clock cycle, the multiprocessor executes the same command on different data. There are
a few different types of memory locations available to each processor:
● Each processor has its own 32-bit registers.

● All processors share a common parallel cache. It makes use of a shared memory space.
● All processors share the same constant memory. It makes use of a read-only memory
space.
● All processors share a texture cache. It makes use of a read-only memory space.
EXECUTION MODEL OF CUDA ARCHITECTURE:

The DEVICE is in charge of processing the grid. Each multiprocessor goes through batches of
blocks one by one. Only one multiprocessor processes a block. Memory operations are kept
local, which speeds up the entire process. The number of blocks that can be processed by a
multiprocessor is determined by the number of registers and shared memory required.
Kernel will not start if there is insufficient memory. Active blocks are those that are handled in a
single batch by a single multiprocessor. Two SIMD warps are created for each active block. The
number of threads in each warp is the same. The order of the warps isn't fixed, although they
can be synced. The order of the blocks in a grid is random. Between the blocks, there is no
synchronization. Threads from separate blocks of the same grid are unable to communicate
properly with one another.
GRAPHICS ACCELERATOR INTEGRATION TO THE PARALLEL

ENVIRONMENT:
Graphics accelerators' massive parallel processing capacity isn't their sole benefit. Integration
into a parallel environment, such as a computer cluster, is also possible. The addition of these
accelerators to the cluster adds another level of parallelism. The CUDA programming model
makes parallel processing management and synchronization easier. Each cluster node would
function as a graphics accelerator's HOST.
The total computation can be split into multiple parallel stages. We may encounter a number of
issues, including platform unification for all integrated hardware, as well as the development of
appropriate algorithms for computation on this hardware platform. Memory operations (between
GPU and CPU) latency is the limitation of this acceleration.
During the search for appropriate algorithms, we must examine the computation's time of
realization and communication, and then select which parts will be released on GPU rather than
CPU. We must examine a multi-paradigm approach in this setting. A user can utilize a
multi-paradigm language to make use of more programming methods. It simplifies the syntax of
the language and expands the application areas through the use of extended semantics. As a
result, multi-paradigm languages can solve a problem in more application areas and with
greater flexibility than single-paradigm languages.
CONCLUSION:
The realm of High Performance Computing has seen major advancements thanks to unified
architecture. Graphics accelerators are no longer only tools for processing graphics. G80 and
CUDA are two promising unified architecture implementations.
The development of general-purpose parallel applications was made easier using CUDA. These
applications now have sufficient computational capacity to produce accurate results in a
reasonable amount of time. We may anticipate improvement in this area in terms of
standardization and the power of graphics accelerators in the future. Furthermore, we can
anticipate advancements in applications that make use of graphics accelerators as a computing
tool.

Parallel & Distributed Computing Report

Uploaded by

Copyright:

Available Formats

Parallel & Distributed Computing Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel & Distributed Computing Report

Uploaded by

Copyright:

Available Formats

Parallel & Distributed Computing

Compute Unified Device Architecture

Muhammad Anas 1812102

CUDA is a general-purpose computing framework designed by nVidia to run on graphics

COMPOSITION OF PARALLEL COMPUTATION WITH CUDA:

MEMORY MODEL OF CUDA ARCHITECTURE :

HARDWARE IMPLEMENTATION OF CUDA ARCHITECTURE:

● Each processor has its own 32-bit registers.

EXECUTION MODEL OF CUDA ARCHITECTURE:

GRAPHICS ACCELERATOR INTEGRATION TO THE PARALLEL

You might also like