Parallel & Distributed Computing Report
Parallel & Distributed Computing Report
Parallel & Distributed Computing Report
Report:
(CUDA)
Group members:
Introduction:
Tesla devices were designed with general-purpose computing in mind. The creation of
general-purpose apps on GPUs has been greatly simplified thanks to this new technique.
Because of the low cost of this technology, graphics accelerators are now available on desktop
PCs, and we may expect widespread adoption of the CUDA architecture among developers.
CUDA ARCHITECTURE:
CUDA may be easily incorporated into current IT infrastructures. A software development kit
and a C compiler are included with CUDA. The well-known environment has a C compiler built
in. Under Linux and Windows operating systems, this solution is compatible with AMD/INTEL
x86 and x64 microprocessors. Over PCI express x16, a graphics accelerator is connected to a
host machine (host). This interconnection has a speed of 8 GB/s (4 GB/s upstream and 4 GB/s
downstream).
CUDA broadens the scope of general-purpose computing on GPU. In this situation, a parallel
coprocessor is referred to as a GPU. GPUs process massive amounts of data simultaneously.
The CPU is in charge of managing, presenting, and arranging the computing activity. Complex
datasets are broken down into smaller, independent pieces and processed using the same
instructions. To put it another way, data can be processed in parallel. CUDA is made up of a few
basic components. These components are responsible for simultaneous data processing.
CUDA driver, CUDA API, and CUDA mathematics libraries (CUBLAS and CUFFT) are the
components in question. The C compiler greatly facilitates the building of parallel applications. A
developer can devote his or her entire attention to the development of an application.
He does not need to modify a problem in order for it to be implemented using the graphics API.
Developers can write CPU and GPU code as a single unit. Expansion directives tell the compiler
which parts should be handled by the GPU and which should be handled by the CPU. The code
is first compiled with the CUDA GPU compiler, and then with the regular C compiler. This
programming paradigm enables us to run the code on many graphics accelerator versions. The
number of stream processors in use is irrelevant to the programmer. This is how the GPU
programme is realized. On the host machine, the programme is run. The CUDA driver runs code
on the GPU automatically. Through a high-speed interface, the CPU half of the software
communicates with the GPU part.
Special actions in a driver enable communication between the CPU and the GPU. A
programmer is not in charge of managing computing resources. Applications that require
parallel computational capability can benefit from CUDA paired with a massively parallel GPU.
A block is a collection of threads that can effectively collaborate and share data via shared
cache memory. A single kernel is used to handle threads in the same block. Each thread in the
block has a unique identifier. In other words, the ID identifies the thread's placement within the
block. Is the block regarded as a 2D or 3D array in the case of thread location? (x+Dx) is the
thread ID for the (x,y) position in the 2D array.
Local thread operations are performed using local memory and registers. The shared memory is
utilized for thread-to-thread communication. The communication between the blocks and grids is
handled by the global memory.
The HOST memory is not accessible to the thread. Data is copied to DEVICE's global memory if
we need to access this area of the memory. It is also possible to transfer in the reverse
direction. The shared memory model is provided by CUDA. On the HOST, read and write
memory operations are the same on DEVICE. The shared memory approach allows for faster
memory operations and more efficient thread communication. As a result, the application
becomes less reliant on DRAM bandwidth.
Every clock cycle, the multiprocessor executes the same command on different data. There are
a few different types of memory locations available to each processor:
The total computation can be split into multiple parallel stages. We may encounter a number of
issues, including platform unification for all integrated hardware, as well as the development of
appropriate algorithms for computation on this hardware platform. Memory operations (between
GPU and CPU) latency is the limitation of this acceleration.
During the search for appropriate algorithms, we must examine the computation's time of
realization and communication, and then select which parts will be released on GPU rather than
CPU. We must examine a multi-paradigm approach in this setting. A user can utilize a
multi-paradigm language to make use of more programming methods. It simplifies the syntax of
the language and expands the application areas through the use of extended semantics. As a
result, multi-paradigm languages can solve a problem in more application areas and with
greater flexibility than single-paradigm languages.
CONCLUSION:
The realm of High Performance Computing has seen major advancements thanks to unified
architecture. Graphics accelerators are no longer only tools for processing graphics. G80 and
CUDA are two promising unified architecture implementations.
The development of general-purpose parallel applications was made easier using CUDA. These
applications now have sufficient computational capacity to produce accurate results in a
reasonable amount of time. We may anticipate improvement in this area in terms of
standardization and the power of graphics accelerators in the future. Furthermore, we can
anticipate advancements in applications that make use of graphics accelerators as a computing
tool.