Academia.eduAcademia.edu

Striving for Efficiency in Algorithms: Sorting

Sorting is the fundamental algorithmic problem in computer science. It is the first step in solving many other algorithmic problems. Donald Knuth, a world famous computer scientist and author of the book "The Art of Computer Programming, Volume 3: Sorting and Searching" ([6]), wrote: "I believe that virtually every important aspect of programming arises somewhere in the context of searching or sorting". Quicksort is a comparison sorting algorithm that, on average, makes O(n log n) comparisons to sort n items. This is as efficient as a comparison sorting algorithm can be ([1]). Quicksort is often faster in practice than other O(n log n) sorting algorithms and it has another advantage-it sorts in place, that is, the items are rearranged within the array, so it does not require a lot of additional space ([1]). Quicksort was invented by a British computer scientist, C.A.R. Hoare, in 1960. Sir Charles Antony Richard Hoare describes how he invented Quicksort in his interview published in [10]. After graduating from the University of Oxford in 1956, Hoare did his national service in the Royal Navy studying Russian. In 1958 he took a course in Mercury Autocode, which was the programming language used on a computer in Oxford University. Later, he was a visiting student at Moscow State University in the Soviet Union for a year. That is when he developed the Quicksort algorithm. The following is a quote from the interview with Hoare ([10]), where he describes his invention of Quicksort: "The National Physical Laboratory was starting a project for the automatic translation of Russian into English, and they offered me a job. I met several of the people in Moscow who were working on machine translation, and I wrote my first published article, in Russian, in a journal called Machine Translation. In those days the dictionary in which you had to look up in order to translate from Russian to English was stored on a long magnetic tape in alphabetical order. Therefore it paid to sort the words of the sentence into the same alphabetical order before consulting the dictionary, so that you could look up all the words in the sentence on a single pass of the magnetic tape. I thought with my knowledge of Mercury Autocode, I'll be able to think up how I would conduct this preliminary sort. After a few moments I thought of the obvious algorithm, which is now called bubble sort, and rejected that because it was obviously rather slow. I thought of Quicksort as the second thing. It didn't occur to me that this was anything very difficult. It was all an interesting exercise in programming. I think Quicksort is the only really interesting algorithm that I ever developed." Hoare described the algorithm in his papers in 1961 and 1962 ([2], [3], [4], [5]). After its invention by Hoare, Quicksort has undergone extensive analysis by Robert Sedgewick in 1975, 1977, 1978 ([7], [8], [9]). Sedgewick in his paper "Implementing Quicksort programs" ([9]) presented "a practical study of how to implement the Quicksort sorting algorithm and its best variants on real computers". The paper contains the original version of Quicksort and presents step-by-step modifications to the algorithm which, as Sedgewick says, make its implementation on real computers more efficient.

Striving for Efficiency in Algorithms: Sorting Inna Pivkina∗ Sorting is the fundamental algorithmic problem in computer science. It is the first step in solving many other algorithmic problems. Donald Knuth, a world famous computer scientist and author of the book “The Art of Computer Programming, Volume 3: Sorting and Searching” ([6]), wrote: “I believe that virtually every important aspect of programming arises somewhere in the context of searching or sorting”. Quicksort is a comparison sorting algorithm that, on average, makes O(n log n) comparisons to sort n items. This is as efficient as a comparison sorting algorithm can be ([1]). Quicksort is often faster in practice than other O(n log n) sorting algorithms and it has another advantage - it sorts in place, that is, the items are rearranged within the array, so it does not require a lot of additional space ([1]). Quicksort was invented by a British computer scientist, C.A.R. Hoare, in 1960. Sir Charles Antony Richard Hoare describes how he invented Quicksort in his interview published in [10]. After graduating from the University of Oxford in 1956, Hoare did his national service in the Royal Navy studying Russian. In 1958 he took a course in Mercury Autocode, which was the programming language used on a computer in Oxford University. Later, he was a visiting student at Moscow State University in the Soviet Union for a year. That is when he developed the Quicksort algorithm. The following is a quote from the interview with Hoare ([10]), where he describes his invention of Quicksort: “The National Physical Laboratory was starting a project for the automatic translation of Russian into English, and they offered me a job. I met several of the people in Moscow who were working on machine translation, and I wrote my first published article, in Russian, in a journal called Machine Translation. In those days the dictionary in which you had to look up in order to translate from Russian to English was stored on a long magnetic tape in alphabetical order. Therefore it paid to sort the words of the sentence into the same alphabetical order before consulting the dictionary, so that you could look up all the words in the sentence on a single pass of the magnetic tape. I thought with my knowledge of Mercury Autocode, I’ll be able to think up how I would conduct this preliminary sort. After a few moments I thought of the obvious algorithm, which is now called bubble sort, and rejected that because it was obviously rather slow. I thought of Quicksort as the second thing. It didn’t occur to me that this was anything very difficult. It was all an interesting exercise in programming. I think Quicksort is the only really interesting algorithm that I ever developed.” Hoare described the algorithm in his papers in 1961 and 1962 ([2], [3], [4], [5]). After its invention by Hoare, Quicksort has undergone extensive analysis by Robert Sedgewick in 1975, 1977, 1978 ([7], [8], [9]). Sedgewick in his paper “Implementing Quicksort programs” ([9]) presented “a practical study of how to implement the Quicksort sorting algorithm and its best variants on real computers”. The paper contains the original version of Quicksort and presents step-by-step modifications to the algorithm which, as Sedgewick says, make its implementation on real computers more efficient. ∗ Department of Computer Science; New Mexico State University; Las Cruces, NM 88003; [email protected]. 1 The main idea of the project is to experimentally verify whether results of Sedgewick in [9] are still true nowadays by implementing the algorithms and comparing their running times. The paper is attached at the end of the project. The project is divided into several parts. Each part contains a reading assignment and a list of tasks. 1 Project Part 1 Read the paper until section “Worst Case” on page 850 (read sections Introduction, The Algorithm, Improvements, Removing Recursion, Small Subfiles). Exercise 1.1. Program 1 in the paper describes original version of Quicksort. The operators and the constructs which are used in Program 1 are different from those which are used in the textbook (Cormen et al. [1]). For instance, operator :=: and the control structures loop ... repeat. Rewrite Program 1 using pseudocode conventions from the textbook. Exercise 1.2. Partitioning algorithm which is used in Program 1 is different from the version of PARTITION given in the textbook. Write a pseudocode for procedure Partition1 which performs partitioning using the partitioning algorithm from Program1. Use pseudocode conventions from the textbook. Exercise 1.3. Demonstrate the operation of Partition1 from exercise 1.2 on the array A=h11, 12, 4, 8, 15, 9, 7, 1, 6, 16i. Show the values of the array and auxiliary values (values of i and j) after each step. (You may use Figure 1 on page 849 of the paper as an example of showing the values of the array.) Exercise 1.4. Demonstrate the operation of Program 1 (quicksorting) on the array A from the above question. You may use Figure 2 on page 849 of the paper as an example. Exercise 1.5. Demonstrate the operation of Partition1 from exercise 1.2 on the array consisting of equal numbers A=h3, 3, 3, 3, 3, 3i. Show the values of the array and auxiliary values (values of i and j) after each step. (You may use Figure 1 on page 849 of the paper as an example of showing the values of the array.) Notice that this example fits the description of the partitioning process from the last paragraph of the first column on page 848 of the paper which starts with words “If equal keys are present among A[1],. . . ,A[N]. . . ”. Exercise 1.6. Give an example of array A such that: • A has 5 elements, and • exactly 2 out of these 5 elements are equal, and • using Partition1 on A fits the description of the partitioning process from the last paragraph of the first column on page 848 of the paper which starts with words “If equal keys are present among A[1],. . . ,A[N]. . . ”. Demonstrate the operation Partition 1 on this array A. Exercise 1.7. On page 848 Sedgewick writes: “For example, rather than using the “sentinel” A[N+1] = ∞ we could use loop: i:= i+1; while i≤ N and A[i]< ν repeat; 2 for the i pointer increment, but this would be far less efficient.” Explain why this would be far less efficient. Exercise 1.8. What strategy does Sedgewick adopt on the question of how keys equal to the partitioning element should be treated? Exercise 1.9. On page 849 Sedgewick says: “For example, if the file A[1],. . . ,A[N] is already in order, then the program will invoke itself to recursive depth N”. Verify the statement by drawing tree of recursive calls for execution of Program 1 on already sorted array A[1],. . . ,A[N] (assume that all the elements being sorted are distinct). Exercise 1.10. Rewrite procedure insertionsort on pages 849-850 of the paper using pseudocode conventions from the textbook. Notice that it is different from the insertion sort implementation in the textbook. Exercise 1.11. Demonstrate the procedure insertionsort from exercise 1.10 on the array A=h11, 12, 4, 8, 15, 9, 7, 1, 6, 16i. Show the contents of the array after each execution of the outer loop. Exercise 1.12. On page 850 Sedgewick writes: “The obvious way to improve Program 1 is to change the first if statement to if r-l≤M then insertionsort(l,r) else . . . ”. Why this is an improvement? Exercise 1.13. What is an even better way to modify Program 1 than the one in question 1.12? Exercise 1.14. What is the best value of the length of unsorted subfiles? 2 Project Part 2 Read the paper until section “Assembly Language” on page 852 (read sections Worst Case, Medianof-three Modification, Implementation). Exercise 2.1. What method did Hoare suggest to make the worst case unlikely to occur in practice? Explain in your own words why this method works. Exercise 2.2. How did Sedgewick suggest to modify Program 1 in order to implement the method from exercise 2.1? When describing the modification please use pseudocode conventions from the textbook. Exercise 2.3. Describe in your own words the idea of median-of-three modification. What is the purpose of this modification? Exercise 2.4. What implementation of median-of-three modification does Sedgewick present in the paper? When describing the implementation please use pseudocode conventions from the textbook. Exercise 2.5. Rewrite Program 2 using pseudocode conventions from the textbook. Exercise 2.6. Explain why condition A[N+1] = ∞ is needed in Program 2. What will happen if the condition is not there? Exercise 2.7. Demonstrate the operation of Program 2 upon the digits of constant e, that is, A=h2, 7, 1, 8, 2, 8, 1, 8, 2, 8, 4, 5, 9, 0, 4, 5i. Use M=4. You may use Figure 6 on page 852 of the paper as an example. (This exercise is closely related to the next exercise 2.8.) Exercise 2.8. What are the values of AN , BN , CN , SN , DN , EN in the execution of Program 2 on array A from exercise 2.7? (Give exact numbers.) 3 3 Project Part 3 Exercise 3.1. Implement Program 1 from the paper. Assume that the elements to be sorted are positive integers. Given a filename, the program should read integers to be sorted from the file, sort them, and write sorted integers to an output file. Your program should also compute how much time was spent on sorting. The computed time should not include time spent on reading and writing data to/from files. Exercise 3.2. Implement Program 2 from the paper. Assume that the elements to be sorted are positive integers. Given a filename, the program should read integers to be sorted from the file, sort them, and write sorted integers to an output file. Your program should also compute how much time was spent on sorting. The computed time should not include time spent on reading and writing data to/from files. Exercise 3.3. Write a separate program (let us call it Generator) which will create a file containing a specified number of random integers within certain range. The inputs to this program should include the number of integers you want to be in the file and the lower and upper bounds for the integers. For example, it should be able to create a file with 1000 integers between 100 (inclusive) and 999 (inclusive). Exercise 3.4. Generate 10 different input files using your Generator program. Each file should have ten thousand integers in the range from 1 to 10000. Exercise 3.5. Sort each of the input files generated in exercise 3.4 using program 1. Record sorting times. What are the average, minimum and maximum sorting times? Exercise 3.6. Sort each of the input files generated in exercise 3.4 using Program 2 with values of M equal to 3, 6, 9, 10, 14, 20. Record sorting time in each run. What are the average, minimum and maximum sorting times for each of these values of M? Exercise 3.7. Summarize your results in a table like the following: Sorting time Average Minimum Maximum Program 1 M=3 M=6 Program 2 M=9 M=10 M=14 M=20 Answer the following questions. (a) Is sorting time for Program 2 always better than sorting time for Program 1 in your experiments? (b) What value of M in your experiments gave the best sorting time? Is that value of M the same as the best value of M in Sedgewick’s paper? (c) Is there a big difference between average, minimum and maximum sorting times in your experiments? Notes to the instructor The project is designed for a junior level Data Structures and Algorithms course. It is based on the paper by R. Sedgewick ”Implementing Quicksort Programs”. In the paper Sedgewick presents 4 ”a practical study of how to implement the Quicksort sorting algorithm and its best variants on real computers”. The paper contains the original version of Quicksort and its modification, which as Sedgewick says combines the most effective improvements to Quicksort. The main idea of the project is to experimentally verify results of Sedgewick by implementing the algorithms and comparing their running times. The project allows students to learn and practice Quicksort, insertion sort, recursive thinking, using implicit stack data structure to remove recursion, computing running times of algorithms, etc. The project is divided into three parts. Each part contains a reading assignment and a list of tasks. References [1] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2 edition, 2002. [2] C. A. R. Hoare. Algorithm 63: Partition. Communications of the ACM, 4(7):321, 1961. [3] C. A. R. Hoare. Algorithm 64: Quicksort. Communications of The ACM, 4(7):321, 1961. [4] C. A. R. Hoare. Algorithm 65: Find. Communications of The ACM, 4(7):321–322, 1961. [5] C. A. R. Hoare. Quicksort. Computer Journal, 5(1):10–15, 1962. [6] D. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. AddisonWesley, 2 edition, 1998. [7] R. Sedgewick. Quicksort. PhD thesis, Stanford University, Stanford, CA, May 1975. Stanford Computer Science Report STAN-CS-75-492. [8] R. Sedgewick. The analysis of quicksort programs. Acta Informatica, 7:327–355, 1977. [9] R. Sedgewick. Implementing quicksort programs. Communications of The ACM, 21:847–857, 1978. [10] L. Shustek. Interview: An interview with C.A.R. Hoare. Communications of The ACM, 52(3):38–41, March 2009. 5 6 7 8 9 10 11 12 13 14 15 16