Academia.eduAcademia.edu

Automatic Parallelization using Open MP Directives

2019, International Journal of Innovative Technology and Exploring Engineering

With the increase in the advent of parallel computing, it has become necessary to write OpenMP programs to achieve better speedup and to exploit parallel hardware efficiently. However, to achieve this, the programmers are required to understand OpenMP directives and clauses, the dependencies in their code, etc. A small mistake made by them, such as wrongly analysing a dependency or wrong data scoping of a variable, can result in an incorrect or inefficient program. In this paper, we propose a system which can automate the process of parallelization of a serial C code. The system accepts a serial program as input and generates the corresponding parallel code in OpenMP without altering the core logic of the program. The system has used different data scoping and work sharing constructs available in OpenMP platform.The system designed here aims at parallelizing “for” loops, “while” loops, nested “for” loops and recursive structures.The system has parallelized “for” loop by considering ...

International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-11, September 2019 Automatic Parallelization using Open MP Directives Deepika Dash, Anala M R Abstract-With the increase in the advent of parallel computing, it has become necessary to write OpenMP programs to achieve better speedup and to exploit parallel hardware efficiently. However, to achieve this, the programmers are required to understand OpenMP directives and clauses, the dependencies in their code, etc. A small mistake made by them, such as wrongly analysing a dependency or wrong data scoping of a variable, can result in an incorrect or inefficient program. In this paper, we propose a system which can automate the process of parallelization of a serial C code. The system accepts a serial program as input and generates the corresponding parallel code in OpenMP without altering the core logic of the program. The system has used different data scoping and work sharing constructs available in OpenMP platform.The system designed here aims at parallelizing “for” loops, “while” loops, nested “for” loops and recursive structures.The system has parallelized “for” loop by considering the induction variable. And converted “while” loop to “for” loop for parallelization. The system is tested by providing several programs such as matrix addition, quick sort, linear search etc. as input. The execution time of programs before and after parallelization is determined and a graph is plotted to help visualize the decrease in execution time. IndexTerms–Automatic Parallelization Tool, collapse, OpenMP, OpenMP directives and clauses, pragma directives, parallel computing, recursive structures, task, taskwait I. INTRODUCTION Parallel computing has become increasingly popular by virtue of the magnitude of benefits that it offers. Its area of application encompasses several fields of science, medicine and engineering such as web search engines, medical imaging and diagnosis and also multiple areas of mathematics.The High-Performance Computing (HPC) market is estimated to grow from USD 28.08 Billion in 2015 and projected to be of USD 36.62 Billion by 2020, at a high Compound Annual Growth Rate (CAGR) of 5.45% during the forecast period [1]. Parallel computing involves breaking down large problems into smaller sub-problems which may be carried out in parallel.This process of problem solving exploits the parallel hardware architecture of modern day computers and hence enables fastercomputation. While this may not make a difference to simple programs, it helps to save a lot of time, power and cost when dealing with very large, complex problems involving big data or heavy computationsThere are several systems which are implemented as serial programs, converting them to parallel programs will give the product an edge as there are multiple benefits from parallelization. Developing the parallel code for a complex serial program involving numerous dependencies and complexities is not an easy task. It requires a thorough understanding of OpenMP platform. Manually parallelizing a program is often prone to errors. Debugging the code also becomes tedious. The system developed aims at addressing such problems, by providing a tool which automates the process of parallelization. Automatic parallelization abstracts parallel computing platforms from developers thereby eliminating the knowledge requirement of developer about parallel computing platform [2]. There are several parallel computing platforms such as OpenMP, CUDA, MPI etc. The system developed uses OpenMP as it is easy to use and portable. The primary function of the system is to generate an accurate and efficient OpenMP program for a given sequential C code. To accomplish this, the system is provided with three main modules: Custom Parser, OpenMP Analyzer and Code Generator. The first step deals with parsing of the code. It checks whether the input program is syntactically correct. It returns error messages, if any, to the user. The parser also populates the data structures namely, variable table, statement and function details tables which are used by the OpenMP Analyzer. The OpenMP Analyzer detects blocks of code which have potential for parallelization such as recursions, for loops, while loops, etc. It also generates the dependency graph. The Code Generator adds the directives and clauses to generate parallel program as final output.The generated parallel codes are checked for correctness and speedup. The results obtained are plotted on a graph. The rest of the paper is organized as follows: In section 2, we present a brief description about tasks, data scoping and collapse clause. In section 3 we describe the methodology by providing algorithms for implementing the various modules. In section 4 and 5 we discuss related work and experimental results respectively. Section 6 provides a conclusion to the paper. The last section, section 7 describes the future enhancements. II. BACKGROUND OpenMP specification version 3.0 introduced a new feature called tasking. Tasking facilitates the parallelization of applications where units of work are generated dynamically, as in recursive structures[3]. Data scoping attribute clauses namely, shared and firstprivate are also used. A brief description of these clauses is given below:  Shared: The shared clause declares the variables in the list to be shared among all the threads in a team. All threads within a team access the same storage area for shared variables [4]. Revised Manuscript Received on September 05, 2019. Prof. Anala M R, Dept. of CSE, RVCE, Bengaluru, India Prof. Deepika Dash, Dept. of CSE, RVCE,Bengaluru, India Retrieval Number: K15460981119/2019©BEIESP DOI: 10.35940/ijitee.K1546.0981119 4059 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Automatic Parallelization using Open MP Directives  Firstprivate: The firstprivate clause provides a superset of the functionality provided by the private clause. The private variable is initialized by the original value of the variable when the parallel construct is encountered [4]. Collapse clause is used to parallelize nested “for” loops. This clause is used only when the loops satisfy prefect nesting and rectangular iteration space. OpenMP provides an atomic directive which specifies the next statement must be done by one thread at a time[5]. The omp atomic directive allows access of a specific memory location atomically. This ensures that race conditions are avoided through direct control of concurrent threads that might read or write to or from the particular memory location[6]. III. METHODOLOGY The system developed enables the user to parallelize a complex serial code involving difficult and time-consuming calculations. The user only needs to provide the serial code in C as input to the system. The system carries out the task of parallelizing the input code by identifying the data and control dependencies, blocks of code that can be parallelized such as recursions, loops etc. These blocks of code are then parallelized by inserting OpenMP clauses and directives as needed. The steps involved in the implementation of the various modules are discussed below. 8: 9: 10: The first step is to detect nested “for” loops and check whether they satisfy perfect nesting and rectangular iteration space. The loop is checked for perfect nesting by setting a flag to ‘1’ if the body of the outer “for” loop consists of any statement apart from the inner “for” loop. Each time the condition is satisfied, a variable named nest_level is incremented. Such nested “for” loops are parallelized by appending collapse clause to the outermost “for” loop’s pragma clause. 3.2 Conversion of “while” loops into equivalent “for” loops The WhileToForConverter module is responsible for converting certain “while” loops suitable for conversion into equivalent “for” loops, by identifying the iteration variable using which the initialization and increment/decrement expressions are generated. This step is carried out during parsing stage itself. The converted loops are then parallelized as regular “for” loops. Algorithm 2: Conversion of “while” loops Input: Sequential C code with “while” loops Output: C code with Equivalent “for” loops 1: procedure WhileToForConverter 2: Determine iteration variable 3: Generate initialization expression 4: Generate increment/decrement expression 5: Replace “while” header with “for” header 6: Remove increment/decrement expression from body of converted loop. 7: end procedure CollapseClauseGenerator 3.1 Parallelization of nested loops using Collapse clause “For” loop parallelization is achieved through the use of the directive “#pragma omp parallel for”. It is possible to parallelize nested “for” loops too by adding collapse clause to the pragma directive of the outermost “for” loop, however, this is beneficial only for certain kinds of nested loops. In some situations, loops perform better without nested parallelization. Thus, there is a necessity to understand conditions to be satisfied in order to make nested loop parallelism beneficial. It has been found that perfect nesting and rectangular iteration space makes nested loops a suitable candidate for use of collapse clause. Loops are said to be perfectly nested if there are no lines of code between the “for” loop headers. An example of such a code is matrix addition. The C code for Matrix multiplication can also be modified to satisfy the criteria. Such programs show considerable improvement in performance on using the collapse clause. It is the responsibility of the system to identify such loops and add the clause accordingly. Algorithm 1: Parallelization of nested loops Input: Sequential C code with nested loops Output: C code with parallelized nested loops 1: procedure CollapseClauseGenerator 2: if nested “for” loops present then 3: if loops are perfectly nested then 4: if rectangular iteration space flag(variable)==1 then 5: Determine level of nesting 6: Append collapse clause to “for” loop’s pragma directive 7: Endif Retrieval Number: K15460981119/2019©BEIESP DOI: 10.35940/ijitee.K1546.0981119 Endif Endif end procedureCollapseClauseGenerator The module first identifies “while” loops with single conditions and which are not nested within other loops. Iteration variable is determined for the loops which satisfy the above conditions. During the parsing stage, variable names and their values are stored in a variable table. Once the iteration variable is identified from the “while” loop header, its value is determined from the variable table. This data is used to generate the initialization expression. The increment/decrement expression is generated by identifying the relation operator in the condition expression of the “while” loop. 3.3 Parallelization of recursions using task directive Two modules are provided for the implementing parallelization of recursions. These modules help identify recursive structures of code to parallelize using tasks and analyse them for shared and firstprivate variables by applying task scoping rules. The taskwait directive is added where require. The RecursionRecognizer module makes use of the statement table, where details such as statement type, nesting level, etc are stored. 4060 Published By: Blue Eyes Intelligence Engineering & Sciences Publication International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-11, September 2019 The statement type indicates whether a particular statement is a function call, expression, declaration, etc. If the statement type is “function call”, then the module checks whether the function is recursive and accordingly sets a flag in the function details table. Algorithm 3: Detection of recursions Input: Statement table Output: Updated function details table 1: procedure RecursionRecognizer 2: Read statement table. 3: Detect function calls. 4: if function is recursive then 5: Set RecursionBit(variable)to 1. 6: Endif 7: end procedure RecursionRecognizer. in loops to determine which loops can be safely and efficiently executed in parallel [7]. It also performs a check on the code to determine the need for parallelization. Synchronisation and loop scheduling can both be significant sources of overhead in shared memory parallel programs [8]. V. The TaskDirectiveGenerator module adds the task construct above the lines of code as indicated by the RecursionRecognizer module. This module makes use of the statement and function details table which is updated by the RecursionRecognizer. RESULTS AND DISCUSSIONS The system developed converts input serial C code to corresponding parallel code by adding OpenMP clauses and directives. However, the correctness of the output program generated by the system needs to be verified. This can be achieved by comparing the output generated by both the codes (serial and parallel). A set of sample programs are chosen for each module developed. The first set involves programs for collapse directive, this includes matrix addition and matrix multiplication. The second set consists of programs for whileto-for conversion such as string palindrome check and linear search. The third set deals with recursive programs such as quick sort, merge sort and parallel sum which is used for verifying the module for task directive. Algorithm 4: Generation of task directive Input: Statement table and function details table Output: C code with parallelized recursions 1: procedure TaskDirectiveGenerator 2: Read statement and function details table. 3: if RecursionBit(variable)==1 then 4: Scope variables. 5: Insert task directive. 6: Endif 7: if value returned by recursive function call is used in following lines then 8: Insert taskwait directive. 9: Endif 10: end procedure TaskDirectiveGenerator. Performance Comparison 14 12.8453 Execution time(s) 12 10 8 6.372405 6 4 3.1599 2 2.0607 0.9514 0.321 0.6467 0 The module checks whether the given function call has recursion flag set to 1. If yes, it adds the task directive above its recursive calls. Then data scope of the variables is determined and appended to the task directive. If the function’s return type is not void then it adds a taskwait construct after the last recursive call. IV. RELATED WORK Par4All and Cetus do not support parallelization of recursive structures. They may have to be replaced by “for” loops to enable parallelization. The system developed here is capable of parallelizing recursive functions through the use of task and taskwait directives. This is checked by providing C programs such as Quick-sort, Merge-sort, etc. as input. In the output programs generated, the recursive functions are parallelized by insertion of task directive at the appropriate lines. Pluto, Cetus and Par4All cannot handle parallelization of “while” loops. Hence “while” loops need to be converted to equivalent “for” loops. The system developed here is capable of automatically converting suitable “while” loops into equivalent “for” loops, which are then parallelized. Intel C++ and Fortran Compilers have the ability to analyse the dataflow Retrieval Number: K15460981119/2019©BEIESP DOI: 10.35940/ijitee.K1546.0981119 0.029 String Linear Search Serial Matrix Quick Sort Multiplication Parallel Fig. 1Performance comparison The execution time of the input serial programs and output parallel programs were determined. A graph is plotted to analyse the results. The graph in Fig. 1 shows that execution time of parallel programs generated by the system is much less than that of their serial counterparts. VI. CONCLUSION The aim of the system, to automate the process of parallelization of C code, was achieved. The system was tested by running several programs for each module, e.g.: Matrix Multiplication and Matrix Addition for testing nested loop parallelism, Quick-sort, Merge-sort, etc. for testing parallelization of recursive structures etc. 4061 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Automatic Parallelization using Open MP Directives The generated parallel codes were checked for correctness. A sample program consisting of a set of serial codes such as Matrix Multiplication, Monte Carlo pi calculation, Linear Search and Recursive Sum was run to generate parallel code. The serial code took an execution time of 13.16s while the parallel code took an execution time of 7.73s for the same input size. This shows that parallel codes that were generated by the system perform computations faster. The system helps to eliminate errors that may result on manual parallelization. The correctness of the parallel program was verified by comparing the outputs generated before and after parallelization. The speedup was also demonstrated by comparing the time taken during execution for large input sizes.The system, however has certain shortcomings. OpenMP provides a multitude of clauses and directives, but, only a subset of them have been implemented here.“while loops” are parallelized by converting them to equivalent “for loops”, this method is not suitable in some situations where “while loops” are preferred over “for loops”, such “while loops” are required to be parallelized using different techniques(e.g. task directive).The user is required to have a basic understanding of OpenMP so that he/she may know as to when it is beneficial to parallelize a serial code. For eg: a serial code which is likely to work on smaller input sizes will run faster than the parallel counterpart due to the overhead incurred by parallelization (forks and joins).Fork/join of OpenMP threads consumes extra resource and overhead due to OpenMP threads increases with number of OpenMP threads used [9]. The system accepts only C code as input and converts it into a parallel code using OpenMP directives. However, there are other alternatives to OpenMP such as CUDA and MPI. CUDA is a parallel computing platform and programming model invented by NVIDIA[10]. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU)[10].MPI is preferred when running a program on clusters. While this system provides a suitable platform for generating OpenMP programs for shared memory multiprocessors, it may not be the preferred choice for a user whose application demands CUDA or MPI program. VII. FUTURE ENHANCEMENTS The project can be further enhanced by adding some more features such as:  Schedule clause:OpenMP offers the schedule clause, with a set of predefined iteration scheduling strategies, to specify how (and when) this assignment of iterations to threads is done [11]. The system must be able to identify the best suited scheduling technique for a given input program.  Synchronization constructs:Two statements that are not determined to be concurrent cannot execute in parallel in any execution of the program. If two statements are determined to be concurrent, they may execute concurrently[12]. Some of these directives Retrieval Number: K15460981119/2019©BEIESP DOI: 10.35940/ijitee.K1546.0981119 such as critical, taskwait, etc. have already been implemented but there is still scope to incorporate the other synchronization directives such as master, barrier, flush, etc.  Task construct: The task directive has been implemented for handling recursions, however it is also possible to parallelize while loops using tasks [3]. Adding this feature will help to parallelize the “while” loops which cannot or should not be converted to “for” loops. ACKNOWLEDGEMENT We express our gratitude towards Amit G Bhat and Meghana N Babu, Dept. of CSE, R.V. College of Engineering,for providing insight and expertise that greatly assisted the project.We also thank our parents for instilling the confidence to complete this project. Lastly, we thank all the faculty members of R.V College of Engineering for the constant support and encouragement. REFERENCES MarketsandMarkets, “High Performance Computing Market by Components Type (Servers, Storage, Networking Devices, & Software), Services, Deployment Type, Server Price Band, Vertical, & Region Global Forecast to 2020,” Rep. TC 2204, Feb. 2016. 2. A. G. Bhat, Meghana N Babu, Anala MR “Towards Automatic Parallelization of “for” loops,” in Advance Computing Conference (IACC), Bangalore, IEEE, 2015, pp. 136-142. 3. “Sun Studio 12 Update 1: OpenMP API User's Guide,” Oracle, [Online]. Available: https://docs.oracle.com/cd/E19205-01/8207883/6nj43o69j/index.html. 4. IBM Knowledge Center, “Shared and private variables in a parallel environment,” [Online]. Available: https://www.ibm.com/support/knowledgecenter/SSLTBW_2.2.0/com.ib m.zos.v2r2.cbcpx01/cuppvars.htm. 5. “OpenMP Synchronization,” 26 July 2016. [Online]. Available: http://cs.umw.edu/~finlayson/class/fall16/cpsc425/notes/13-openmpsync.html. 6. IBM Knowledge Center, “#pragma omp atomic - purpose,” [Online]. Available: https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.0/com.ib m.xlc131.aix.doc/compiler_ref/prag_omp_atomic.html. 7. Intel Developer Zone, “Automatic Parallelization with Intel Compilers,” 2 Nov. 2011. [Online]. Available: https://software.intel.com/enus/articles/automatic-parallelization-with-intel-compilers. 8. J. M. Bull, “Measuring Synchronisation and Scheduling Overheads in OpenMP,” in Proceedings of First European Workshop on OpenMP, 1999, pp. 99-105. 9. W. Zhang et al., High Performance Computing and Applications: Second International Conference, HPCA 2009, Shanghai, China, August 10-12, 2009, Revised Selected Papers, Berlin Heidelberg: Springer, 2010. 10. nvidia, “CUDA Parallel Computing Platform,” nvidia, [Online]. Available: http://www.nvidia.com/object/cuda_home_new.html#sthash.YeEESwLd .dpu. 11. O. Hernandez et al., “Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications,” in OpenMP Shared Memory Parallel Programming: International Workshops, IWOMP 2005 and IWOMP 2006, Eugene, OR, USA, June 1-4, 2005, Reims, France, June 12-15, 2006. Proceedings", Berlin Heidelberg, Springer, 2008, pp. 267-278. 1. 4062 Published By: Blue Eyes Intelligence Engineering & Sciences Publication International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-11, September 2019 12. Y. Zhang et al., “Concurrency Analysis for Shared Memory Programs with Textually Unaligned Barriers,” in Languages and Compilers for Parallel Computing: 20th International Workshop, LCPC 2007, Urbana, IL, USA, October 11-13, 2007, Revised Selected Papers, Berlin, Heidelberg, Springer, 2008, pp. 95-109. 13. [13] Anala M R, Deepika Dash “ Framework for Automatic Parallelization” in 25th International Conference on High Performance Computing Workshops(HiPCW), Bangalore, IEEE, 2018, 978-1-7281-0114-9. Retrieval Number: K15460981119/2019©BEIESP DOI: 10.35940/ijitee.K1546.0981119 4063 Published By: Blue Eyes Intelligence Engineering & Sciences Publication