String Sorts (Java)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71
At a glance
Powered by AI
Some of the main takeaways are that there are different string sorting algorithms like radix sort, quicksort, suffix arrays. String keys can be sorted in linear time by using the characters as indexes in an array. Specialized string sorting algorithms may be needed for non-random string data.

Some of the string sorting algorithms discussed are key-indexed counting, LSD radix sort, MSD radix sort, 3-way radix quicksort, and suffix arrays.

The char data type in C is typically an 8-bit integer that can represent only 256 characters and supports 7-bit ASCII. The char data type in Java is a 16-bit unsigned integer that supports original 16-bit Unicode and supports 21-bit Unicode 3.0.

Algorithms

R OBERT S EDGEWICK | K EVIN W AYNE

5.1 S TRING S ORTS


! strings in Java ! key-indexed counting

Algorithms
F O U R T H E D I T I O N

! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays

R OBERT S EDGEWICK | K EVIN W AYNE


http://algs4.cs.princeton.edu

5.1 S TRING S ORTS


! strings in Java ! key-indexed counting

Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu

! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays

String processing
String. Sequence of characters.

Important fundamental abstraction.

Information processing. Genomic sequences. Communication systems (e.g., email). Programming systems (e.g., Java programs).
The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology. M. V. Olson
0

The char data type


C char data type. Typically an 8-bit integer.

Supports 7-bit ASCII.Data Compression 667 Can represent only 256 characters.
6.5

ou HexDump a bit0 1 2 3 4 5 6 7 8 9 A B C D E F -encoded charac- 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI eful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US r, use the rst hex 2 SP ! # $ % & ( ) * + , - . / second hex digit 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? nd the character 4 @ A B C D E F G H I J K L M N O e, 31 encodes the 5 P Q R S T U V W X Y Z [ \ ] ^ _ r J, and so forth. 6 ` a b c d e f g h i j k l m n o I, so the rst hex 7 p q r s t u v w x y z { | } ~ DEL numbers starting Hexadecimal to ASCII conversion table mbers 20 and 7F) g control characharacters are left over from the days when physical devices rolled by ASCII input; the table highlights a few that you Java char data type. A null 16-bit unsigned integer. ample SP is the space character, NUL is the character, LF ge-return. Supports original 16-bit Unicode.

U+0041

U+00E1

U+2202

U+1D50A

Unicode characters

Unicode 3.0 (awkwardly). data compression requires us 21-bit to reorient our thinking about Supports

d output to include binary encoding of data. BinaryStdIn e the methods that we need. They provide a way for you to

I (heart) Unicode

The String data type


String data type in Java. Sequence of characters (immutable). Length. Number of characters. Indexing. Get the ith character. Substring extraction. Get a contiguous subsequence of characters. String concatenation. Append one character to end of another string.

s.length()

0
s

1 T

2 T

3 A

4 C

5 K

6 A

7 T

8 D

9 10 11 12 A W N

s.charAt(3) s.substring(7, 11)

Fundamental constant-time String operations

The String data type: Java implementation


public final class String { private char[] value; private int offset; private int length; private int hash; implements Comparable<String> // // // // characters index of first char in array length of string cache of hashCode()
length

public int length() { return length; }

value[]

X
0

X
1

A
2

T
3

T
4

A
5

C
6

K
7

X
8

public char charAt(int i) { return value[i + offset];

offset

private String(int offset, int length, char[] value) { this.offset = offset; this.length = length; this.value = value; copy of reference to } original char array public String substring(int from, int to) { return new String(offset + from, to - from, value);

The String data type: performance


String data type (in Java). Sequence of characters (immutable). Underlying implementation. Immutable char[] array, offset, and length.

String operation length() charAt() substring() concat() guarantee 1 1 1 N extra space 1 1 1 N

Memory. 40 + 2N bytes for a virgin String of length N.


can use byte[] or char[] instead of String to save space (but lose convenience of String data type)
8

The StringBuilder data type


StringBuilder data type. Sequence of characters (mutable). Underlying implementation. Resizing char[] array and length.

String operation length() charAt() substring() concat() guarantee 1 1 1 N extra space 1 1 1 N

StringBuilder guarantee 1 1 N 1* extra space 1 1 N 1*


* amortized

Remark. StringBuffer data type is similar, but thread safe (and slower).

String vs. StringBuilder


Q. How to efficiently reverse a string?

A.

public static String reverse(String s) { String rev = ""; for (int i = s.length() - 1; i >= 0; i--) rev += s.charAt(i); return rev; }

quadratic time

B.

public static String reverse(String s) { StringBuilder rev = new StringBuilder(); for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i)); return rev.toString(); }

linear time

10

String challenge: array of sufxes


Q. How to efficiently form array of suffixes?
input string a
0

a
1

c
2

a
3

a
4

g
5

t
6

t
7

t
8

a
9

10 11 12 13 14

su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

a a c a a g t t t a c a a g c

a c a a g t t t a c a a g c

c a a g t t t a c a a g c

a a g t t t a c a a g c

a g t t t a c a a g c

g t t t a c a a g c

t t t a c a a g c

t t a c a a g c

t a c a a g c

a c a a g c

c a a g c

a a g c a g c g c c

11

String vs. StringBuilder


Q. How to efficiently form array of suffixes?

A.

public static String[] suffixes(String s) { int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N); return suffixes; }

linear time and linear space

B.

public static String[] suffixes(String s) { int N = s.length(); StringBuilder sb = new StringBuilder(s); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = sb.substring(i, N); return suffixes; }

quadratic time and quadratic space

12

Longest common prex


Q. How long to compute length of longest common prefix?
p
0

r
1

e
2

f
3

e
4

t
5

c
6

h
7

public static int lcp(String s, String t) { int N = Math.min(s.length(), t.length()); for (int i = 0; i < N; i++) if (s.charAt(i) != t.charAt(i)) return i; return N; }

linear time (worst case) sublinear time (typical case)

Running time. Proportional to length D of longest common prefix. Remark. Also can compute compareTo() in sublinear time.
13

Alphabets
Digital key. Sequence of digits over fixed alphabet. Radix. Number of digits R in alphabet.
604
CHAPTER 6
!

Strings

name

R()

lgR()

characters

BINARY OCTAL DECIMAL HEXADECIMAL DNA LOWERCASE UPPERCASE PROTEIN BASE64 ASCII EXTENDED_ASCII UNICODE16

2 8 10 16 4 26 26 20 64 128 256 65536

1 3 4 4 2 5 5 5 6 7 8 16
Standard alphabets

01 01234567 0123456789 0123456789ABCDEF ACTG abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ ACDEFGHIKLMNPQRSTVWY ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef ghijklmnopqrstuvwxyz0123456789+/

ASCII characters extended ASCII characters Unicode characters

holds the frequencies in Count is an example of a character-indexed array. With a Java

14

5.1 S TRING S ORTS


! strings in Java ! key-indexed counting

Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu

! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays

Review: summary of the performance of sorting algorithms


Frequency of operations = key compares.

algorithm

guarantee

random

extra space

stable?

operations on keys

insertion sort

N2

N2

yes

compareTo()

mergesort

N lg N

N lg N

yes

compareTo()

quicksort

1.39 N lg N

1.39 N lg N

c lg N

no

compareTo()

heapsort

2 N lg N

2 N lg N

no

compareTo()
* probabilistic

Lower bound. ~ N lg N compares required by any compare-based algorithm. Q. Can we do better (despite the lower bound)? A. Yes, if we don't depend on key compares.
16

Key-indexed counting: assumptions about keys


Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. Applications.
name section Anderson 2 Brown 3 Davis 3 Garcia 4 Harris 1 Jackson 3 Johnson 4 Jones 3 Martin 1 Martinez 2 Miller 2 Moore 1 Robinson 2 Smith 4 Taylor 3 Thomas 4 Thompson 4 White 2 Williams 3 Wilson 4 keys are small integers
Typical candidate for key-indexed counting
17

input

Sort string by first letter. Sort class roster by section. Sort phone numbers by area code. Subroutine in a sorting algorithm. [stay tuned]
Remark. Keys may have associated data ! can't just count up number of keys of each value.

(by section) Harris Martin Moore Anderson Martinez Miller Robinson White Brown Davis Jackson Jones Taylor Williams Garcia Johnson Smith Thomas Thompson Wilson

sorted result

1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4

Key-indexed counting demo


Goal. Sort an array a[] of N integers between 0 and R - 1.

Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]

R=6

int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];

0 1 2 3 4 5 6 7 8 9 10 11

d a c f f b d b f b e a
use a b c d e f for for for for for for 0 1 2 3 4 5

18

Key-indexed counting demo


Goal. Sort an array a[] of N integers between 0 and R - 1.

Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]

offset by 1 [stay tuned]

int N = a.length; int[] count = new int[R+1];


count frequencies

0 1 2 3 4 5 6 7 8 9 10 11

d a c f f b d b f b e a
r count[r]

for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];

a b c d e f -

0 2 3 1 2 1 3

19

Key-indexed counting demo


Goal. Sort an array a[] of N integers between 0 and R - 1.

Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]

int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++;
compute cumulates

0 1 2 3 4 5 6 7 8 9 10 11

d a c f f b d b f b e a
r count[r]

a b c d e f -

0 2 5 6 8 9 12

for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];

6 keys < d, 8 keys < e so ds go in a[6] and a[7]


20

Key-indexed counting demo


Goal. Sort an array a[] of N integers between 0 and R - 1.

Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]

aux[i]

int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r];
move items

0 1 2 3 4 5 6 7 8 9 10 11

d a c f f b d b f b e a
r count[r]

0 1 2 3 4 5 6 7 8 9 10 11

a a b b b c d d e f f f

a b c d e f -

2 5 6 8 9 12 12

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];

21

Key-indexed counting demo


Goal. Sort an array a[] of N integers between 0 and R - 1.

Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]

aux[i]

int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];
copy back

0 1 2 3 4 5 6 7 8 9 10 11

a a b b b c d d e f f f
r count[r]

0 1 2 3 4 5 6 7 8 9 10 11

a a b b b c d d e f f f

a b c d e f -

2 5 6 8 9 12 12

for (int i = 0; i < N; i++) a[i] = aux[i];

22

Key-indexed counting: analysis


Proposition. Key-indexed counting uses ~ 11 N + 4 R array accesses to sort N items whose keys are integers between 0 and R - 1.
for (int i = 0; i < N; i++) aux[count[a[i].key(d)]++] = a[i];

Proposition. Key-indexed counting uses extra space proportional to N + R. count[] Stable?


!
1 0 0 0 0 0 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3 3 2 3 4 4 4 4 4 4 4 4 4 5 6 6 7 7 7 7 7 8 8 8 3 8 8 9 10 10 10 11 11 12 12 12 12 12 12 12 13 13 13 13 14 14 4 14 14 14 14 15 15 15 16 16 16 16 16 16 16 17 17 18 19 19 19 20
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13] a[14] a[15] a[16] a[17] a[18] a[19]

Anderson Brown Davis Garcia Harris Jackson Johnson Jones Martin Martinez Miller Moore Robinson Smith Taylor Thomas Thompson White Williams Wilson

2 3 3 4 1 3 4 3 1 2 2 1 2 4 3 4 4 2 3 4

Harris Martin Moore Anderson Martinez Miller Robinson White Brown Davis Jackson Jones Taylor Williams Garcia Johnson Smith Thomas Thompson Wilson

1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4

aux[0] aux[1] aux[2] aux[3] aux[4] aux[5] aux[6] aux[7] aux[8] aux[9] aux[10] aux[11] aux[12] aux[13] aux[14] aux[15] aux[16] aux[17] aux[18] aux[19]

Distributing the data (records with key 3 highlighted)

23

5.1 S TRING S ORTS


! strings in Java ! key-indexed counting

Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu

! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays

Least-signicant-digit-rst string sort


LSD string (radix) sort.

Consider characters from right to left. Stably sort using d character as the key (using key-indexed counting).
th sort key (d = 2) sort key (d = 1) sort key (d = 0) 0 1 2 3 4 5 6 7 8 9 10 11 d a c f f b d b f b e a a d a a e a a e e e b c b d b d e d d e d d b e 0 1 2 3 4 5 6 7 8 9 10 11 d c e a f b d f b f b a a a b d a a a e e e e c b b b d d d d d d e e e 0 1 2 3 4 5 6 7 8 9 10 11 d c f b d e a a f b f b a a a a a b c d e e e e b b d d d b e d d d e e 0 1 2 3 4 5 6 7 8 9 10 11 a a b b b c d d e f f f c d a e e a a a b a e e e d d d e b b d b d d e

sort must be stable (arrows do not cross)


25

LSD string sort: correctness proof


Proposition. LSD sorts fixed-length strings in ascending order.
sort key

Pf. [by induction on i] After pass i, strings are sorted by last i characters.
0 1 2 3 4 5 6 7 8 9 10 11 d c f b d e a a f b f b a a a a a b c d e e e e b b d d d b e d d d e e 0 1 2 3 4 5 6 7 8 9 10 11 a a b b b c d d e f f f c d a e e a a a b a e e e d d d e b b d b d d e

If two strings differ on sort key,


key-indexed sort puts them in proper relative order.

If two strings agree on sort key,


stability keeps them in proper relative order.

Proposition. LSD sort is stable.

sorted from previous passes (by induction)


26

LSD string sort: Java implementation


public class LSD { public static void sort(String[] a, int W) { int R = 256; int N = a.length; String[] aux = new String[N]; for (int d = W-1; d >= 0; d--) { int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i].charAt(d) + 1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i].charAt(d)]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i]; } } }
27

fixed-length W strings radix R

do key-indexed counting for each digit from right to left

key-indexed counting

Summary of the performance of sorting algorithms


Frequency of operations.

algorithm

guarantee

random

extra space

stable?

operations on keys

insertion sort

N2

N2

yes

compareTo()

mergesort

N lg N

N lg N

yes

compareTo()

quicksort

1.39 N lg N

1.39 N lg N

c lg N

no

compareTo()

heapsort

2 N lg N

2 N lg N

no

compareTo()

LSD

2WN

2WN

N+R

yes

charAt()

* probabilistic fixed-length W keys

Q. What if strings do not have same length?


28

String sorting interview question


Problem. Sort one million 32-bit integers. Ex. Google (or presidential) interview. Which sorting method to use?

Insertion sort. Mergesort. Quicksort. Heapsort. LSD string sort.

29

String sorting interview question

Google CEO Eric Schmidt interviews Barack Obama

30

How to take a census in 1900s?


1880 Census. Took 1,500 people 7 years to manually process data. Herman Hollerith. Developed counting and sorting machine to automate.

Use punch cards to record data (e.g., gender, age). Machine sorts one column at a time (into one of 12 bins). Typical question: how many women of age 20 to 30?

Hollerith tabulating machine and sorter

punch card (12 holes per column)

1890 Census. Finished months early and under budget!


31

How to get rich sorting in 1900s?


Punch cards. [1900s to 1950s]

Also useful for accounting, inventory, and business processes. Primary medium for data entry, storage, and processing.
Hollerith's company later merged with 3 others to form Computing Tabulating Recording Corporation (CTRC); company renamed in 1924.

IBM 80 Series Card Sorter (650 cards per minute)


32

LSD string sort: a moment in history (1960s)

card punch

punched cards

card reader

mainframe

line printer

To sort a card deck - start on right column - put cards into hopper - machine distributes into bins - pick up cards (stable) - move left one column - continue until sorted card sorter

not related to sorting

Lysergic Acid Diethylamide (Lucy in the Sky with Diamonds)


33

5.1 S TRING S ORTS


! strings in Java ! key-indexed counting

Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu

! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays

Most-signicant-digit-rst string sort


MSD string (radix) sort.

Partition array into R pieces according to first character


(use key-indexed counting).

Recursively sort all strings that start with each character


(key-indexed counts delineate subarrays to sort).
0 a a d c d e

0 1 2 3 4 5 6 7 8 9 10 11

d a c f f b d b f b e a

a d a a e a a e e e b c

b d b d e d d e d d b e

0 1 2 3 4 5 6 7 8 9 10 11

a a b b b c d d e f f f

d c a e e a a a b a e e

d e d e d b b d b d e d

count[]
a b c d e f -

0 2 5 6 8 9 12

2 3 4

b b b

a e e

d e d

sort subarrays recursively

6 7 8

d d e

a a b

b d b

9 10

f f f

a e e

d e d
35

sort key

11

MSD string sort: example


input d

she sells seashells by the sea shore the shells she sells are surely seashells

are by lo she sells seashells sea shore shells she sells surely seashells the hi the

are by sells seashells sea sells seashells she shore shells she surely the the

are by seashells sea seashells sells sells she shore shells she surely the the

are by sea seashells seashells sells sells she shore shells she surely the the

are by sea seashells seashells sells sells she shore shells she surely the the

are by sea seashells seashells sells sells she shore shells she surely the the

are by seas seashells seashells sells sells she shells shore she surely the the

are by sea seashells seashells sells sells she shells shore she surely the the

need to examine every character in equal keys are are are are by by by by sea sea sea sea seashells seashells seashells seashells seashells seashells seashells seashells sells sells sells sells sells sells sells sells she she she she shells shells shells shells she she she she shore shore shore shore surely surely surely surely the the the the the the the the

end-of-string goes before any char value are are are by by by sea sea sea seashells seashells seashells seashells seashells seashells sells sells sells sells sells sells she she she she she she shells shells shells shore shore shore surely surely surely the the the the the the

output

are by sea seashells seashells sells sells she she shells shore surely the the
36

Trace of recursive calls for MSD string sort (no cutoff for small subarrays, subarrays of size 0 and 1 omitted)

Variable-length strings
Treat strings as if they had an extra char at end (smaller than any char).
why smaller? 0 1 2 3 4 5 6 7 s s s s s s s s e e e h h h h u a a l e e e o r -1 s l -1 -1 l r e l e l s -1 y -1 -1 she before shells h s e -1 l l s -1

private static int charAt(String s, int d) { if (d < s.length()) return s.charAt(d); else return -1; }

C strings. Have extra char '\0' at end ! no extra work needed.


37

MSD string sort: Java implementation


public static void sort(String[] a) { aux = new String[a.length]; sort(a, aux, 0, a.length, 0); }

can recycle aux[] array but not count[] array

private static void sort(String[] a, String[] aux, int lo, int hi, int d) { if (hi <= lo) return; key-indexed counting int[] count = new int[R+2]; for (int i = lo; i <= hi; i++) count[charAt(a[i], d) + 2]++; for (int r = 0; r < R+1; r++) count[r+1] += count[r]; for (int i = lo; i <= hi; i++) aux[count[charAt(a[i], d) + 1]++] = a[i]; for (int i = lo; i <= hi; i++) a[i] = aux[i - lo];
sort R subarrays recursively for (int r = 0; r < R; r++) sort(a, aux, lo + count[r], lo + count[r+1] - 1, d+1);

}
38

MSD string sort: potential for disastrous performance


Observation 1. Much too slow for small subarrays.
count[]

Each function call needs its own count[] array. ASCII (256 counts): 100x slower than copy pass for N = 2. Unicode (65,536 counts): 32,000x slower for N = 2.
Observation 2. Huge number of small subarrays because of recursion.

a[]
0 1 b a 0 1

aux[]
a b

39

Cutoff to insertion sort


Solution. Cutoff to insertion sort for small subarrays.

Insertion sort, but start at d character. Implement less() so that it compares starting at d
th

th

character.

public static void sort(String[] a, int lo, int hi, int d) { for (int i = lo; i <= hi; i++) for (int j = i; j > lo && less(a[j], a[j-1], d); j--) exch(a, j, j-1); } private static boolean less(String v, String w, int d) { return v.substring(d).compareTo(w.substring(d)) < 0;

in Java, forming and comparing substrings is faster than directly comparing chars with charAt()
40

MSD string sort: performance


Number of characters examined.

MSD examines just enough characters to sort the keys. Number of characters examined depends on keys. Can be sublinear in input size!
compareTo() based sorts can also be sublinear!
Random (sublinear) Non-random with duplicates (nearly linear) Worst case (linear)

1EIO402 1HYL490 1ROZ572 2HXE734 2IYE230 2XOR846 3CDB573 3CVP720 3IGJ319 3KNA382 3TAV879 4CQP781 4QGI284 4YHV229

are by sea seashells seashells sells sells she she shells shore surely the the

1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377

Characters examined by MSD string sort


41

Summary of the performance of sorting algorithms


Frequency of operations.
algorithm guarantee random extra space stable? operations on keys

insertion sort

N2

N2

yes

compareTo()

mergesort

N lg N

N lg N

yes

compareTo()

quicksort

1.39 N lg N

1.39 N lg N

c lg N

no

compareTo()

heapsort

2 N lg N

2 N lg N

no

compareTo()

LSD

2NW

2NW

N+R

yes

charAt()

MSD

2NW

N log R N

N+DR

yes

charAt()

D = function-call stack depth (length of longest prefix match)

* probabilistic fixed-length W keys average-length W keys


42

MSD string sort vs. quicksort for strings


Disadvantages of MSD string sort.

Extra space for aux[]. Extra space for count[]. Inner loop has a lot of instructions. Accesses memory "randomly" (cache inefficient).
Disadvantage of quicksort.

Linearithmic number of string compares (not linear). Has to rescan many characters in keys with long prefix matches.

Goal. Combine advantages of MSD and quicksort.


43

5.1 S TRING S ORTS


! strings in Java ! key-indexed counting

Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu

! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays

3-way string quicksort (Bentley and Sedgewick, 1997)


Overview. Do 3-way partitioning on the dth character.

Less overhead than R-way partitioning in MSD string sort. Does not re-examine characters equal to the partitioning char
(but does re-examine characters not equal to the partitioning char).
partitioning item use first character to partition into "less", "equal", and "greater" subarrays

she sells seashells by the sea shore the shells she sells are surely seashells

by are seashells she seashells sea shore surely shells she sells sells the the
45

recursively sort subarrays, excluding first character for middle subarray

3-way string quicksort: trace of recursive calls


partitioning item

she sells seashells by the sea shore the shells she sells are surely seashells

by are seashells she seashells sea shore surely shells she sells sells the the

are by seashells she seashells sea shore surely shells she sells sells the the

are by seashells sells seashells sea sells shells she surely shore she the the

are by seashells sea seashells sells sells shells she surely shore she the the

Trace of rst few recursive calls for 3-way string quicksort (subarrays of size 1 not shown)

46

3-way string quicksort: Java implementation

private static void sort(String[] a) { sort(a, 0, a.length - 1, 0); } private static void sort(String[] a, int lo, int hi, int d) { if (hi <= lo) return; 3-way partitioning (using dth character) int lt = lo, gt = hi; int v = charAt(a[lo], d); int i = lo + 1; while (i <= gt) to handle variable-length strings { int t = charAt(a[i], d); if (t < v) exch(a, lt++, i++); else if (t > v) exch(a, i, gt--); else i++; } sort(a, lo, lt-1, d); if (v >= 0) sort(a, lt, gt, d+1); sort(a, gt+1, hi, d); }
47

sort 3 subarrays recursively

3-way string quicksort vs. standard quicksort


Standard quicksort.

Uses ~ 2 N ln N string compares on average. Costly for keys with long common prefixes (and this is a common case!)
3-way string (radix) quicksort.

Uses ~ 2 N ln N character compares on average for random strings. Avoids re-comparing long common prefixes.

Fast Algorithms for Sorting and Searching Strings


Jon L. Bentley* Robert Sedgewick#

Abstract
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algorithms date back at least to the 1960s but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching.

that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is commonly regarded as the fastest symbol table implementation. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches. In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a natural and elegant way to adapt classical algorithms to this important class of applications. Section 6 turns to more difficult string-searching problems. Partial-match queries allow dont care characters

48

1. Introduction
Section 2 briefly reviews Hoares [9] Quicksort and

3-way string quicksort vs. MSD string sort


MSD string sort.

Is cache-inefficient. Too much memory storing count[]. Too much overhead reinitializing count[] and aux[].
library of Congress call numbers

3-way string quicksort.

Has a short inner loop. Is cache-friendly. Is in-place.

Bottom line. 3-way string quicksort is method of choice for sorting strings.
49

Summary of the performance of sorting algorithms


Frequency of operations.
algorithm guarantee random extra space stable? operations on keys

insertion sort

N2

N2

yes

compareTo()

mergesort

N lg N

N lg N

yes

compareTo()

quicksort

1.39 N lg N

1.39 N lg N

c lg N

no

compareTo()

heapsort

2 N lg N

2 N lg N

no

compareTo()

LSD

2NW

2NW

N+R

yes

charAt()

MSD

2NW

N log R N

N+DR

yes

charAt()

3-way string quicksort

1.39 W N lg N

1.39 N lg N

log N + W

no

charAt()

* probabilistic fixed-length W keys average-length W keys

50

5.1 S TRING S ORTS


! strings in Java ! key-indexed counting

Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu

! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays

Keyword-in-context search
Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context).
% more it was it was it was it was it was it was it was it was it was it was ! tale.txt the best of times the worst of times the age of wisdom the age of foolishness the epoch of belief the epoch of incredulity the season of light the season of darkness the spring of hope the winter of despair

Applications. Linguistics, databases, web search, word processing, .


52

Keyword-in-context search
Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context).
characters of % java KWIC tale.txt 15 surrounding context search o st giless to search for contraband her unavailing search for your fathe le and gone in search of her husband t provinces in search of impoverishe dispersing in search of other carri n that bed and search the straw hold

better thing t is a far far better thing that i do than some sense of better things else forgotte was capable of better things mr carton ent

Applications. Linguistics, databases, web search, word processing, .


53

Sufx sort
input string

i
0

t
1

w
2

a
3

s
4

b
5

e
6

s
7

t
8

i
9

10 11 12 13 14

form su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

sort su"xes to bring repeated substrings together

i t w a s b e s t i t w a s w

t w a s b e s t i t w a s w

w a s b e s t i t w a s w

a s b e s t i t w a s w

s b e s t i t w a s w

b e s t i t w a s w

e s t i t w a s w

s t i t w a s w

t i t w a s w

i t w a s w

t w a s w

w a s w a s w s w w

3 12 5 6 0 9 4 7 13 8 1 10 14 2 11

a a b e i i s s s t t t w w w

s s e s t t b t w i w w

b w s t w w e i

e s t t i a a s t i t s s t w t w b w i a w a s w a s w e s t i t w a s w t w a s w s w

t w a s w a s b e s t i t w a s w a s w

a s b e s t i t w a s w a s w

54

Keyword-in-context search: sufx-sorting solution

Preprocess: suffix sort the text. Query: binary search for query; scan until mismatch.
KWIC search for "search" in Tale of Two Cities

!
632698 713727 660598 67610 4430 42705 499797 182045 143399 411801 158410 691536 536569 484763

s e a l e d _ m y _ l e t t e r _ a n d _ s e a m s t r e s s _ i s _ l i f t e d _ s e a m s t r e s s _ o f _ t w e n t y _ s e a m s t r e s s _ w h o _ w a s _ w i s e a r c h _ f o r _ c o n t r a b a n d s e a r c h _ f o r _ y o u r _ f a t h e s e a r c h _ o f _ h e r _ h u s b a n d s e a r c h _ o f _ i m p o v e r i s h e s e a r c h _ o f _ o t h e r _ c a r r i s e a r c h _ t h e _ s t r a w _ h o l d s e a r e d _ m a r k i n g _ a b o u t _ s e a s _ a n d _ m a d a m e _ d e f a r s e a s e _ a _ t e r r i b l e _ p a s s s e a s e _ t h a t _ h a d _ b r o u g h !
55

Longest repeated substring


Given a string of N characters, find the longest repeated substring.

a g c c a a t c a g

a g c t c g a a a a

c a t g c a g c c c

a g a t g t a t a a

a a a c g a t c c g

g g t g a g c t a a

t t c t a a t c c a

t t c c g t a a t a

t a t g g a g c a a

a t t t c g c a c a

c a g c c a t c t a

a c t a g c a t a a

a t g t g c g c c a

g g t a a c c a g c

c g g t c c t a a t

a t t a a t a g c c

t c a t a a g a a t

g g c c g g c g g a

a t a g g a t t a t

t c c a c t c t c a

g a a g g a a a g t

c a c a g c t t a c

t a a t g a c a c t

g c c c g c g c c a

t c t a g a a t a t

a t a t g t t g a a

c g c c t a a g c a

t a t g a c c t c a

a a a a t a a c a a

Applications. Bioinformatics, cryptanalysis, data compression, ...

56

Longest repeated substring: a musical application


Visualize repetitions in music.
Mary Had a Little Lamb

http://www.bewitched.com

Bach's Goldberg Variations

57

Longest repeated substring


Given a string of N characters, find the longest repeated substring.

Brute-force algorithm.

Try all indices i and j for start of possible match. Compute longest common prefix (LCP) for each pair.

Analysis. Running time " D N 2 , where D is length of longest match.


58

Longest repeated substring: a sorting solution


input string

a
0

a
1

c
2

a
3

a
4

g
5

t
6

t
7

t
8

a
9

10 11 12 13 14

form su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

sort su"xes to bring repeated substrings together

a a c a a g t t t a c a a g c

a c a a g t t t a c a a g c

c a a g t t t a c a a g c

a a g t t t a c a a g c

a g t t t a c a a g c

g t t t a c a a g c

t t t a c a a g c

t t a c a a g c

t a c a a g c

a c a a g c

c a a g c

a a g c a g c g c c

0 11 3 9 1 12 4 14 10 2 13 5 8 7 6

a a a a a a a c c c g g t t t

a a a c c g g a a c t a t t

c g g a a c t

a c t a a

a g t t t a c a a g c t t a c a a g c g c g t t t a c a a g c

t t a c a a g c

a g c a g t t t a c a a g c t c a t t a c a a a a c c g a a a a g c c g c a g c

compute longest prex between adjacent su"xes

a
0

a
1

c
2

a
3

a
4

g
5

t
6

t
7

t
8

a
9

c
59

10 11 12 13 14

Longest repeated substring: Java implementation


public String lrs(String s) { int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N); Arrays.sort(suffixes); String lrs = ""; for (int i = 0; i < N-1; i++) { int len = lcp(suffixes[i], suffixes[i+1]); if (len > lrs.length()) lrs = suffixes[i].substring(0, len); } return lrs; }

create suffixes (linear time and space) sort suffixes find LCP between adjacent suffixes in sorted order

% java LRS < mobydick.txt ,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th

60

Sorting challenge
Problem. Five scientists A, B, C, D, and E are looking for long repeated substring in a genome with over 1 billion nucleotides.

A has a grad student do it by hand. B uses brute force (check all pairs). C uses suffix sorting solution with insertion sort. D uses suffix sorting solution with LSD string sort. ! E uses suffix sorting solution with 3-way string quicksort.
but only if LRS is not long (!)

Q. Which one is more likely to lead to a cure cancer?

61

Longest repeated substring: empirical analysis

input file LRS.java amendments.txt aesop.txt mobydick.txt chromosome11.txt pi.txt pipi.txt

characters 2,162 18,369 191,945 1.2 million 7.1 million 10 million 20 million

brute 0.6 sec 37 sec 1.2 hours 43 hours 2 months 4 months forever

suffix sort 0.14 sec 0.25 sec 1.0 sec 7.6 sec 61 sec 84 sec ???

length of LRS 73 216 58 79 12,567 14 10 million

estimated

62

Sufx sorting: worst-case input


Bad input: longest repeated substring very long.

Ex: Ex:
0 1 2 3 4 5 6 7 8 9

same letter repeated N times. two copies of the same Java codebase.
form su"xes sorted su"xes

t w i n s t w i n s

w i n s t w i n s

i n s t w i n s

n s t w i n s

s t w i n s

t w i n s

w i n s i n s n s s

9 8 7 6 5 4 3 2 1 0

i i n n s s t t w w

n s n s t w i n s s s t w i n s t w w i i w i i n n i n n s s n s s s t w i n s t w i n s

LRS needs at least 1 + 2 + 3 + ... + D character compares, where D = length of longest match. Running time. Quadratic (or worse) in D for LRS (and also for sort).
63

Sufx sorting challenge


Problem. Suffix sort an arbitrary string of length N.

Q. What is worst-case running time of best algorithm for problem?

Quadratic. ! Linearithmic. ! Linear. Nobody knows.

Manber-Myers algorithm suffix trees (beyond our scope)

64

Sufx sorting in linearithmic time


Manber-Myers MSD algorithm overview.

Phase 0: sort on first character using key-indexed counting sort. Phase i: given array of suffixes sorted on first 2 characters,
i-1

create array of suffixes sorted on first 2i characters. Worst-case running time. N lg N.

Finishes after lg N phases. Can perform a phase in linear time. (!)

[ahead]

65

Linearithmic sufx sort example: phase 0


original su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

key-indexed counting sort (rst character)

b a b a a a a b c b a b a a a a a 0

a b a a a a b c b a b a a a a a 0

b a a a a b c b a b a a a a a 0

a a a a b c b a b a a a a a 0

a a a b c b a b a a a a a 0

a a b c b a b a a a a a 0

a b c b a b a a a a a 0

b c b a b a a a a a 0

c b a b a a a a a 0

b a b a a a a a 0

a b a a a a a 0

b a a a a a 0

a a a a a 0

a a a a 0

a a a 0 a a 0 a 0 0

17 1 16 3 4 5 6 15 14 13 12 10 0 9 11 7 2 8

0 a a a a a a a a a a a b b b b b c

b 0 a a a b a a a a b a a a c a b

a a a a b c b a b a a a a a 0 a a b c 0 a a a a b b a b a a a b c b 0 a a a a a a a a b b c b a c b a b b a b a a b a a b a a a a a a a a a a a a a a 0 a a 0 a 0 0

0 a a a a a b a a

0 a a a a a b a

a a a 0 a c a

0 b c b a b a a a a a 0 a 0 a a a 0 b a b a a a a a 0 a a 0

sorted

66

Linearithmic sufx sort example: phase 1


original su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

index sort (rst two characters)

b a b a a a a b c b a b a a a a a 0

a b a a a a b c b a b a a a a a 0

b a a a a b c b a b a a a a a 0

a a a a b c b a b a a a a a 0

a a a b c b a b a a a a a 0

a a b c b a b a a a a a 0

a b c b a b a a a a a 0

b c b a b a a a a a 0

c b a b a a a a a 0

b a b a a a a a 0

a b a a a a a 0

b a a a a a 0

a a a a a 0

a a a a 0

a a a 0 a a 0 a 0 0

17 16 12 3 4 5 13 15 14 6 1 10 0 9 11 2 7 8

0 a a a a a a a a a a a b b b b b c

0 a a a a a a a b b b a a a a c b

a a a b a 0 a c a a b b a a b a

a a b c a 0 b a a a a a a a b

a b c b 0

0 c b a b a a a a a 0 b a b a a a a a 0 a b a a a a a 0

a a a a a a a b a

b a a a a a b a a

a b a a a 0 c a a

a c 0 b a

a a a 0 b a b a a a a a 0 c b a b a a a a a 0 0

b a b a a a a a 0 a a a 0 a a 0

sorted

67

Linearithmic sufx sort example: phase 2


original su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

index sort (rst four characters)

b a b a a a a b c b a b a a a a a 0

a b a a a a b c b a b a a a a a 0

b a a a a b c b a b a a a a a 0

a a a a b c b a b a a a a a 0

a a a b c b a b a a a a a 0

a a b c b a b a a a a a 0

a b c b a b a a a a a 0

b c b a b a a a a a 0

c b a b a a a a a 0

b a b a a a a a 0

a b a a a a a 0

b a a a a a 0

a a a a a 0

a a a a 0

a a a 0 a a 0 a 0 0

17 16 15 14 3 12 13 4 5 1 10 6 2 11 0 9 7 8

0 a a a a a a a a a a a b b b b b c

0 a a a a a a a b b b a a a a c b

0 a a a a a b a a c a a b b b a

0 a a a b c a a b a a a a a b

b a 0 c b a a a a a a a b a

c b a b a a a a a 0 0 b a a a b b a a a a a a b b a a c 0 a a a a b a c 0 a b b a a a a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a b a a a a a 0 a 0 c b a b a a a a a 0 0 a a 0 a 0

sorted

68

Linearithmic sufx sort example: phase 3


original su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

index sort (rst eight characters)

b a b a a a a b c b a b a a a a a 0

a b a a a a b c b a b a a a a a 0

b a a a a b c b a b a a a a a 0

a a a a b c b a b a a a a a 0

a a a b c b a b a a a a a 0

a a b c b a b a a a a a 0

a b c b a b a a a a a 0

b c b a b a a a a a 0

c b a b a a a a a 0

b a b a a a a a 0

a b a a a a a 0

b a a a a a 0

a a a a a 0

a a a a 0

a a a 0 a a 0 a 0 0

17 16 15 14 13 12 3 4 5 10 1 6 11 2 9 0 7 8

0 a a a a a a a a a a a b b b b b c

0 a a a a a a a b b b a a a a c b

0 a a a a a b a a c a a b b b a

0 a a a b c a a b a a a a a b

0 a b c b a a a a a a a b a

0 c b a a a b a b a a a a

b a b a b a 0 c a a a a

a b a 0 c a b a b a a

b a a a a a 0 a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a 0 c a a b a a a a a 0 a 0 b a b a a a a a 0 a 0 0

finished (no equal keys)

69

Constant-time string compare by indexing into inverse


original su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

index sort (rst four characters)

inverse[]
0

b a b a a a a b c b a b a a a a a 0

a b a a a a b c b a b a a a a a 0

b a a a a b c b a b a a a a a 0

a a a a b c b a b a a a a a 0

a a a b c b a b a a a a a 0

a a b c b a b a a a a a 0

a b c b a b a a a a a 0

b c b a b a a a a a 0

c b a b a a a a a 0

b a b a a a a a 0

a b a a a a a 0

b a a a a a 0

a a a a a 0

a a a a 0

a a a 0 a a 0 a 0 0

17 16 15 14 3 12 13 4 5 1 10 6

0 + 4 = 4

2 11

9 + 4 = 13

0 9 7 8

0 a a a a a a a a a a a b b b b b c

14 9 12 4 7 8 11 16 17 15 10 13 5 6 3 2 1 0

0 a a a a a a a b b b a a a a c b

0 a a a a a b a a c a a b b b a

0 a a a b c a a b a a a a a b

b a 0 c b a a a a a a a b a

c b a b a a a a a 0 0 b a a a b b a a a a a a b b a a c 0 a a a a b a c 0 a b b a a a a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a b a a a a a 0 a 0 c b a b a a a a a 0 0 a a 0 a 0

4 5 6 7 8 9 10 11 12 13 14 15 16 17

suffixes4[13] " suffixes4[4]

(because inverse[13] < inverse[4])


70

so suffixes8[9] " suffixes8[0]

String sorting summary


We can develop linear-time sorts.

Key compares not necessary for string keys. Use characters as index in an array.
We can develop sublinear-time sorts.

Input size is amount of data in keys (not number of keys). Not all of the data has to be examined.
3-way string quicksort is asymptotically optimal.

1.39 N lg N chars for random data.


Long strings are rarely random in practice.

Goal is often to learn the structure! May need specialized algorithms.

71

You might also like