Sorting
Sorting
Sorting
UC Berkeley 10/24/02
1 Hashing
While binary search trees provide logarithmic search times on average, we would like to do even better. For
integer keyed items, we can trivially get constant search time by using an array, with one position for each
possible key. However, with 232 possible keys, there’s no way we could create an array large enough.
Instead, we could get almost constant search time if we create an array twice as large as our data set,
and store a value k at position kmodm, where m is the size of the array. To search for a value k, we again
compute k mod m and then check the corresponding array position. We would expect an even distribution
of values, so we expect one or two values per array position. To retain this expectation, we may need to
k
resize the array as we insert values. We want to keep the load factor, m , at about 0.5. Resizing potentially
requires us to move all values, but the total amortized cost of insertion is still constant. Searching is also
constant in this scheme.
There is the problem of multiple values mapping to the same array position, or bucket. This situation
is called a collision. There are multiple ways of resolving collisions. One way is to place a value that maps
to an occupied bucket to the next empty bucket. Another is to have linked lists at each bucket, and store
values in the lists. This is the solution we’ll use.
This is fine for integer values, but what about arbitrary objects? We first compute an integer corre-
sponding to the object, called a hash code, and use that to map the objects to positions. There are two
conditions on a hash code. Equal objects (as defined by the equals() method) must have equal hash codes,
and the hash code for an object cannot change while it is in a hash table. Then, assuming the hash code
can be computed quickly and distributes objects evenly in a table, this gives us constant-time searching for
arbitrary objects.
How should we compute a hash code on an object? Many objects can be represented as k-tuples of
integers. For example, an object corresponding to a point in two-dimensional space can be represented by a
double of integers, converting the floating point x and y values bitwise to integers. A good hash code for a
k-tuple (x0 , x1 , ..., xk−1 ) is
x0 · an−1 + x1 · an−2 + ... + xn−2 · a + xn−1
where a is a constant. Choosing a prime number minimizes the chance of a collision.
Our hash code formula works even if k is unbounded, such as in strings. Java’s String class uses the
following formula:
public int hashCode() {
int code = 0;
for (int i = 0; i < length(); i++) {
code = 31 * code + charAt(i);
}
}
This is our formula, with a = 31 and xi =charAt(i).
2 Sorting
The topic of sorting really requires no introduction. We start with an unsorted sequence, and want a sorted
sequence as the result. We begin by discussing simple but slow sorts, and move on to quicker but more
complicated algorithms. We will neglect the possibility of duplicate elements for most of the sorts, and we
will use integers as the keys in all of them.
1
Figure 1: A fast integer lookup table, with a load factor of 1.
void selectionSort(int[] x) {
int tmp, minpos;
for (int i = 0; i < x.length; i++) { // i is the start of the unsorted part
// find the smallest remaining element
minpos = i;
for (int j = i+1; j < x.length; j++) {
if (x[j] < x[minpos]) {
minpos = j;
}
}
// swap the smallest element with the first unsorted element
tmp = x[i];
x[i] = x[minpos];
x[minpos] = tmp;
}
}
Selection sort is a particulary slow algorithm. Finding the smallest remaining unsorted element takes n2
time on average, and must be done n times, for an O(n2 ) running time. The running time is always the
same regardless of the distribution of values in the input.
2
Figure 2: Selection sort applied to a sequence of integers.
3
Find the first element in y that is greater than temp, call it z;
if z exists then:
Insert x before z;
else:
Insert x at the end of y;
fi;
od;
return y;
}
On average, finding the greater element in the result list takes n2 time, and must be done n times, for
a total of O(n2 ) running time. However, in some cases such as a reverse sorted list, finding the element is
constant, so the running time reduces to linear. Modifications to the algorithm can be made such that sorted
rather than reverse sorted lists can be linearly resorted.
int[] mergeSort(int[] x) {
if x.length == 1 then:
return x;
else:
Split the array into two as equal as possible pieces y and z.
y = mergeSort(y);
4
Figure 3: Insertion sort applied to a sequence of integers.
5
Figure 4: The divide step of a merge sort.
z = mergeSort(z);
return merge(y, z);
fi;
}
The stack of recursive calls in merge sort are arranged as a complete binary tree, with about 2n recursive
calls. Thus the divide portion of the algorithm runs in linear time. Now before each recursive call returns,
it merges the results of the next two recursive calls, which is linear in the number of elements to be merged.
But the total number of elements to be merged in each level of the tree is n, so each level takes n time to
merge. The tree has about lg n levels, so the merge step takes n lg n total time. Thus merge sort runs in
O(n lg n).
6
Figure 5: The merge step of a merge sort.
7
Figure 6: An in place quick sort. Set boundaries are denoted by thick lines.
8
Figure 8: An LSD radix sort.
It takes linear time to compute the range of the elements and linear time to place the elements in their
corresponding buckets. But iterating through the table takes time linear in the number of buckets, or linear
in the range of the elements. So the running time of bucket sort is in O(max(n, r)) where n is the number
of elements and r is their range. If the range is on the order of n, then bucket sort is linear. Buf if the range
is large, then the sort may be slower than a quadratic algorithm.
9
in an integer is at most 32 (since we would use the base-two representation in practice), and we can treat
this as a constant factor. So radix sort is always linear.
10