String Sorts (Java)
String Sorts (Java)
String Sorts (Java)
Algorithms
F O U R T H E D I T I O N
! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays
String processing
String. Sequence of characters.
Information processing. Genomic sequences. Communication systems (e.g., email). Programming systems (e.g., Java programs).
The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology. M. V. Olson
0
Supports 7-bit ASCII.Data Compression 667 Can represent only 256 characters.
6.5
ou HexDump a bit0 1 2 3 4 5 6 7 8 9 A B C D E F -encoded charac- 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI eful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US r, use the rst hex 2 SP ! # $ % & ( ) * + , - . / second hex digit 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? nd the character 4 @ A B C D E F G H I J K L M N O e, 31 encodes the 5 P Q R S T U V W X Y Z [ \ ] ^ _ r J, and so forth. 6 ` a b c d e f g h i j k l m n o I, so the rst hex 7 p q r s t u v w x y z { | } ~ DEL numbers starting Hexadecimal to ASCII conversion table mbers 20 and 7F) g control characharacters are left over from the days when physical devices rolled by ASCII input; the table highlights a few that you Java char data type. A null 16-bit unsigned integer. ample SP is the space character, NUL is the character, LF ge-return. Supports original 16-bit Unicode.
U+0041
U+00E1
U+2202
U+1D50A
Unicode characters
Unicode 3.0 (awkwardly). data compression requires us 21-bit to reorient our thinking about Supports
d output to include binary encoding of data. BinaryStdIn e the methods that we need. They provide a way for you to
I (heart) Unicode
s.length()
0
s
1 T
2 T
3 A
4 C
5 K
6 A
7 T
8 D
9 10 11 12 A W N
value[]
X
0
X
1
A
2
T
3
T
4
A
5
C
6
K
7
X
8
offset
private String(int offset, int length, char[] value) { this.offset = offset; this.length = length; this.value = value; copy of reference to } original char array public String substring(int from, int to) { return new String(offset + from, to - from, value);
Remark. StringBuffer data type is similar, but thread safe (and slower).
A.
public static String reverse(String s) { String rev = ""; for (int i = s.length() - 1; i >= 0; i--) rev += s.charAt(i); return rev; }
quadratic time
B.
public static String reverse(String s) { StringBuilder rev = new StringBuilder(); for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i)); return rev.toString(); }
linear time
10
a
1
c
2
a
3
a
4
g
5
t
6
t
7
t
8
a
9
10 11 12 13 14
su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a c a a g t t t a c a a g c
a c a a g t t t a c a a g c
c a a g t t t a c a a g c
a a g t t t a c a a g c
a g t t t a c a a g c
g t t t a c a a g c
t t t a c a a g c
t t a c a a g c
t a c a a g c
a c a a g c
c a a g c
a a g c a g c g c c
11
A.
public static String[] suffixes(String s) { int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N); return suffixes; }
B.
public static String[] suffixes(String s) { int N = s.length(); StringBuilder sb = new StringBuilder(s); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = sb.substring(i, N); return suffixes; }
12
r
1
e
2
f
3
e
4
t
5
c
6
h
7
public static int lcp(String s, String t) { int N = Math.min(s.length(), t.length()); for (int i = 0; i < N; i++) if (s.charAt(i) != t.charAt(i)) return i; return N; }
Running time. Proportional to length D of longest common prefix. Remark. Also can compute compareTo() in sublinear time.
13
Alphabets
Digital key. Sequence of digits over fixed alphabet. Radix. Number of digits R in alphabet.
604
CHAPTER 6
!
Strings
name
R()
lgR()
characters
BINARY OCTAL DECIMAL HEXADECIMAL DNA LOWERCASE UPPERCASE PROTEIN BASE64 ASCII EXTENDED_ASCII UNICODE16
1 3 4 4 2 5 5 5 6 7 8 16
Standard alphabets
14
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays
algorithm
guarantee
random
extra space
stable?
operations on keys
insertion sort
N2
N2
yes
compareTo()
mergesort
N lg N
N lg N
yes
compareTo()
quicksort
1.39 N lg N
1.39 N lg N
c lg N
no
compareTo()
heapsort
2 N lg N
2 N lg N
no
compareTo()
* probabilistic
Lower bound. ~ N lg N compares required by any compare-based algorithm. Q. Can we do better (despite the lower bound)? A. Yes, if we don't depend on key compares.
16
input
Sort string by first letter. Sort class roster by section. Sort phone numbers by area code. Subroutine in a sorting algorithm. [stay tuned]
Remark. Keys may have associated data ! can't just count up number of keys of each value.
(by section) Harris Martin Moore Anderson Martinez Miller Robinson White Brown Davis Jackson Jones Taylor Williams Garcia Johnson Smith Thomas Thompson Wilson
sorted result
1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]
R=6
int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
use a b c d e f for for for for for for 0 1 2 3 4 5
18
Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
r count[r]
for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];
a b c d e f -
0 2 3 1 2 1 3
19
Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]
int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++;
compute cumulates
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
r count[r]
a b c d e f -
0 2 5 6 8 9 12
for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];
Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]
aux[i]
int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r];
move items
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
r count[r]
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
a b c d e f -
2 5 6 8 9 12 12
for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];
21
Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]
aux[i]
int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];
copy back
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
r count[r]
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
a b c d e f -
2 5 6 8 9 12 12
22
Anderson Brown Davis Garcia Harris Jackson Johnson Jones Martin Martinez Miller Moore Robinson Smith Taylor Thomas Thompson White Williams Wilson
2 3 3 4 1 3 4 3 1 2 2 1 2 4 3 4 4 2 3 4
Harris Martin Moore Anderson Martinez Miller Robinson White Brown Davis Jackson Jones Taylor Williams Garcia Johnson Smith Thomas Thompson Wilson
1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
aux[0] aux[1] aux[2] aux[3] aux[4] aux[5] aux[6] aux[7] aux[8] aux[9] aux[10] aux[11] aux[12] aux[13] aux[14] aux[15] aux[16] aux[17] aux[18] aux[19]
23
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays
Consider characters from right to left. Stably sort using d character as the key (using key-indexed counting).
th sort key (d = 2) sort key (d = 1) sort key (d = 0) 0 1 2 3 4 5 6 7 8 9 10 11 d a c f f b d b f b e a a d a a e a a e e e b c b d b d e d d e d d b e 0 1 2 3 4 5 6 7 8 9 10 11 d c e a f b d f b f b a a a b d a a a e e e e c b b b d d d d d d e e e 0 1 2 3 4 5 6 7 8 9 10 11 d c f b d e a a f b f b a a a a a b c d e e e e b b d d d b e d d d e e 0 1 2 3 4 5 6 7 8 9 10 11 a a b b b c d d e f f f c d a e e a a a b a e e e d d d e b b d b d d e
Pf. [by induction on i] After pass i, strings are sorted by last i characters.
0 1 2 3 4 5 6 7 8 9 10 11 d c f b d e a a f b f b a a a a a b c d e e e e b b d d d b e d d d e e 0 1 2 3 4 5 6 7 8 9 10 11 a a b b b c d d e f f f c d a e e a a a b a e e e d d d e b b d b d d e
key-indexed counting
algorithm
guarantee
random
extra space
stable?
operations on keys
insertion sort
N2
N2
yes
compareTo()
mergesort
N lg N
N lg N
yes
compareTo()
quicksort
1.39 N lg N
1.39 N lg N
c lg N
no
compareTo()
heapsort
2 N lg N
2 N lg N
no
compareTo()
LSD
2WN
2WN
N+R
yes
charAt()
29
30
Use punch cards to record data (e.g., gender, age). Machine sorts one column at a time (into one of 12 bins). Typical question: how many women of age 20 to 30?
Also useful for accounting, inventory, and business processes. Primary medium for data entry, storage, and processing.
Hollerith's company later merged with 3 others to form Computing Tabulating Recording Corporation (CTRC); company renamed in 1924.
card punch
punched cards
card reader
mainframe
line printer
To sort a card deck - start on right column - put cards into hopper - machine distributes into bins - pick up cards (stable) - move left one column - continue until sorted card sorter
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
a d a a e a a e e e b c
b d b d e d d e d d b e
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
d c a e e a a a b a e e
d e d e d b b d b d e d
count[]
a b c d e f -
0 2 5 6 8 9 12
2 3 4
b b b
a e e
d e d
6 7 8
d d e
a a b
b d b
9 10
f f f
a e e
d e d
35
sort key
11
she sells seashells by the sea shore the shells she sells are surely seashells
are by lo she sells seashells sea shore shells she sells surely seashells the hi the
are by sells seashells sea sells seashells she shore shells she surely the the
are by seashells sea seashells sells sells she shore shells she surely the the
are by sea seashells seashells sells sells she shore shells she surely the the
are by sea seashells seashells sells sells she shore shells she surely the the
are by sea seashells seashells sells sells she shore shells she surely the the
are by seas seashells seashells sells sells she shells shore she surely the the
are by sea seashells seashells sells sells she shells shore she surely the the
need to examine every character in equal keys are are are are by by by by sea sea sea sea seashells seashells seashells seashells seashells seashells seashells seashells sells sells sells sells sells sells sells sells she she she she shells shells shells shells she she she she shore shore shore shore surely surely surely surely the the the the the the the the
end-of-string goes before any char value are are are by by by sea sea sea seashells seashells seashells seashells seashells seashells sells sells sells sells sells sells she she she she she she shells shells shells shore shore shore surely surely surely the the the the the the
output
are by sea seashells seashells sells sells she she shells shore surely the the
36
Trace of recursive calls for MSD string sort (no cutoff for small subarrays, subarrays of size 0 and 1 omitted)
Variable-length strings
Treat strings as if they had an extra char at end (smaller than any char).
why smaller? 0 1 2 3 4 5 6 7 s s s s s s s s e e e h h h h u a a l e e e o r -1 s l -1 -1 l r e l e l s -1 y -1 -1 she before shells h s e -1 l l s -1
private static int charAt(String s, int d) { if (d < s.length()) return s.charAt(d); else return -1; }
private static void sort(String[] a, String[] aux, int lo, int hi, int d) { if (hi <= lo) return; key-indexed counting int[] count = new int[R+2]; for (int i = lo; i <= hi; i++) count[charAt(a[i], d) + 2]++; for (int r = 0; r < R+1; r++) count[r+1] += count[r]; for (int i = lo; i <= hi; i++) aux[count[charAt(a[i], d) + 1]++] = a[i]; for (int i = lo; i <= hi; i++) a[i] = aux[i - lo];
sort R subarrays recursively for (int r = 0; r < R; r++) sort(a, aux, lo + count[r], lo + count[r+1] - 1, d+1);
}
38
Each function call needs its own count[] array. ASCII (256 counts): 100x slower than copy pass for N = 2. Unicode (65,536 counts): 32,000x slower for N = 2.
Observation 2. Huge number of small subarrays because of recursion.
a[]
0 1 b a 0 1
aux[]
a b
39
Insertion sort, but start at d character. Implement less() so that it compares starting at d
th
th
character.
public static void sort(String[] a, int lo, int hi, int d) { for (int i = lo; i <= hi; i++) for (int j = i; j > lo && less(a[j], a[j-1], d); j--) exch(a, j, j-1); } private static boolean less(String v, String w, int d) { return v.substring(d).compareTo(w.substring(d)) < 0;
in Java, forming and comparing substrings is faster than directly comparing chars with charAt()
40
MSD examines just enough characters to sort the keys. Number of characters examined depends on keys. Can be sublinear in input size!
compareTo() based sorts can also be sublinear!
Random (sublinear) Non-random with duplicates (nearly linear) Worst case (linear)
1EIO402 1HYL490 1ROZ572 2HXE734 2IYE230 2XOR846 3CDB573 3CVP720 3IGJ319 3KNA382 3TAV879 4CQP781 4QGI284 4YHV229
are by sea seashells seashells sells sells she she shells shore surely the the
1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377
insertion sort
N2
N2
yes
compareTo()
mergesort
N lg N
N lg N
yes
compareTo()
quicksort
1.39 N lg N
1.39 N lg N
c lg N
no
compareTo()
heapsort
2 N lg N
2 N lg N
no
compareTo()
LSD
2NW
2NW
N+R
yes
charAt()
MSD
2NW
N log R N
N+DR
yes
charAt()
Extra space for aux[]. Extra space for count[]. Inner loop has a lot of instructions. Accesses memory "randomly" (cache inefficient).
Disadvantage of quicksort.
Linearithmic number of string compares (not linear). Has to rescan many characters in keys with long prefix matches.
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays
Less overhead than R-way partitioning in MSD string sort. Does not re-examine characters equal to the partitioning char
(but does re-examine characters not equal to the partitioning char).
partitioning item use first character to partition into "less", "equal", and "greater" subarrays
she sells seashells by the sea shore the shells she sells are surely seashells
by are seashells she seashells sea shore surely shells she sells sells the the
45
she sells seashells by the sea shore the shells she sells are surely seashells
by are seashells she seashells sea shore surely shells she sells sells the the
are by seashells she seashells sea shore surely shells she sells sells the the
are by seashells sells seashells sea sells shells she surely shore she the the
are by seashells sea seashells sells sells shells she surely shore she the the
Trace of rst few recursive calls for 3-way string quicksort (subarrays of size 1 not shown)
46
private static void sort(String[] a) { sort(a, 0, a.length - 1, 0); } private static void sort(String[] a, int lo, int hi, int d) { if (hi <= lo) return; 3-way partitioning (using dth character) int lt = lo, gt = hi; int v = charAt(a[lo], d); int i = lo + 1; while (i <= gt) to handle variable-length strings { int t = charAt(a[i], d); if (t < v) exch(a, lt++, i++); else if (t > v) exch(a, i, gt--); else i++; } sort(a, lo, lt-1, d); if (v >= 0) sort(a, lt, gt, d+1); sort(a, gt+1, hi, d); }
47
Uses ~ 2 N ln N string compares on average. Costly for keys with long common prefixes (and this is a common case!)
3-way string (radix) quicksort.
Uses ~ 2 N ln N character compares on average for random strings. Avoids re-comparing long common prefixes.
Abstract
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algorithms date back at least to the 1960s but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching.
that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is commonly regarded as the fastest symbol table implementation. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches. In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a natural and elegant way to adapt classical algorithms to this important class of applications. Section 6 turns to more difficult string-searching problems. Partial-match queries allow dont care characters
48
1. Introduction
Section 2 briefly reviews Hoares [9] Quicksort and
Is cache-inefficient. Too much memory storing count[]. Too much overhead reinitializing count[] and aux[].
library of Congress call numbers
Bottom line. 3-way string quicksort is method of choice for sorting strings.
49
insertion sort
N2
N2
yes
compareTo()
mergesort
N lg N
N lg N
yes
compareTo()
quicksort
1.39 N lg N
1.39 N lg N
c lg N
no
compareTo()
heapsort
2 N lg N
2 N lg N
no
compareTo()
LSD
2NW
2NW
N+R
yes
charAt()
MSD
2NW
N log R N
N+DR
yes
charAt()
1.39 W N lg N
1.39 N lg N
log N + W
no
charAt()
50
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
! LSD radix sort ! MSD radix sort ! 3-way radix quicksort ! suffix arrays
Keyword-in-context search
Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context).
% more it was it was it was it was it was it was it was it was it was it was ! tale.txt the best of times the worst of times the age of wisdom the age of foolishness the epoch of belief the epoch of incredulity the season of light the season of darkness the spring of hope the winter of despair
Keyword-in-context search
Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context).
characters of % java KWIC tale.txt 15 surrounding context search o st giless to search for contraband her unavailing search for your fathe le and gone in search of her husband t provinces in search of impoverishe dispersing in search of other carri n that bed and search the straw hold
better thing t is a far far better thing that i do than some sense of better things else forgotte was capable of better things mr carton ent
Sufx sort
input string
i
0
t
1
w
2
a
3
s
4
b
5
e
6
s
7
t
8
i
9
10 11 12 13 14
form su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
i t w a s b e s t i t w a s w
t w a s b e s t i t w a s w
w a s b e s t i t w a s w
a s b e s t i t w a s w
s b e s t i t w a s w
b e s t i t w a s w
e s t i t w a s w
s t i t w a s w
t i t w a s w
i t w a s w
t w a s w
w a s w a s w s w w
3 12 5 6 0 9 4 7 13 8 1 10 14 2 11
a a b e i i s s s t t t w w w
s s e s t t b t w i w w
b w s t w w e i
e s t t i a a s t i t s s t w t w b w i a w a s w a s w e s t i t w a s w t w a s w s w
t w a s w a s b e s t i t w a s w a s w
a s b e s t i t w a s w a s w
54
Preprocess: suffix sort the text. Query: binary search for query; scan until mismatch.
KWIC search for "search" in Tale of Two Cities
!
632698 713727 660598 67610 4430 42705 499797 182045 143399 411801 158410 691536 536569 484763
s e a l e d _ m y _ l e t t e r _ a n d _ s e a m s t r e s s _ i s _ l i f t e d _ s e a m s t r e s s _ o f _ t w e n t y _ s e a m s t r e s s _ w h o _ w a s _ w i s e a r c h _ f o r _ c o n t r a b a n d s e a r c h _ f o r _ y o u r _ f a t h e s e a r c h _ o f _ h e r _ h u s b a n d s e a r c h _ o f _ i m p o v e r i s h e s e a r c h _ o f _ o t h e r _ c a r r i s e a r c h _ t h e _ s t r a w _ h o l d s e a r e d _ m a r k i n g _ a b o u t _ s e a s _ a n d _ m a d a m e _ d e f a r s e a s e _ a _ t e r r i b l e _ p a s s s e a s e _ t h a t _ h a d _ b r o u g h !
55
a g c c a a t c a g
a g c t c g a a a a
c a t g c a g c c c
a g a t g t a t a a
a a a c g a t c c g
g g t g a g c t a a
t t c t a a t c c a
t t c c g t a a t a
t a t g g a g c a a
a t t t c g c a c a
c a g c c a t c t a
a c t a g c a t a a
a t g t g c g c c a
g g t a a c c a g c
c g g t c c t a a t
a t t a a t a g c c
t c a t a a g a a t
g g c c g g c g g a
a t a g g a t t a t
t c c a c t c t c a
g a a g g a a a g t
c a c a g c t t a c
t a a t g a c a c t
g c c c g c g c c a
t c t a g a a t a t
a t a t g t t g a a
c g c c t a a g c a
t a t g a c c t c a
a a a a t a a c a a
56
http://www.bewitched.com
57
Brute-force algorithm.
Try all indices i and j for start of possible match. Compute longest common prefix (LCP) for each pair.
a
0
a
1
c
2
a
3
a
4
g
5
t
6
t
7
t
8
a
9
10 11 12 13 14
form su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a c a a g t t t a c a a g c
a c a a g t t t a c a a g c
c a a g t t t a c a a g c
a a g t t t a c a a g c
a g t t t a c a a g c
g t t t a c a a g c
t t t a c a a g c
t t a c a a g c
t a c a a g c
a c a a g c
c a a g c
a a g c a g c g c c
0 11 3 9 1 12 4 14 10 2 13 5 8 7 6
a a a a a a a c c c g g t t t
a a a c c g g a a c t a t t
c g g a a c t
a c t a a
a g t t t a c a a g c t t a c a a g c g c g t t t a c a a g c
t t a c a a g c
a g c a g t t t a c a a g c t c a t t a c a a a a c c g a a a a g c c g c a g c
a
0
a
1
c
2
a
3
a
4
g
5
t
6
t
7
t
8
a
9
c
59
10 11 12 13 14
create suffixes (linear time and space) sort suffixes find LCP between adjacent suffixes in sorted order
% java LRS < mobydick.txt ,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th
60
Sorting challenge
Problem. Five scientists A, B, C, D, and E are looking for long repeated substring in a genome with over 1 billion nucleotides.
A has a grad student do it by hand. B uses brute force (check all pairs). C uses suffix sorting solution with insertion sort. D uses suffix sorting solution with LSD string sort. ! E uses suffix sorting solution with 3-way string quicksort.
but only if LRS is not long (!)
61
characters 2,162 18,369 191,945 1.2 million 7.1 million 10 million 20 million
brute 0.6 sec 37 sec 1.2 hours 43 hours 2 months 4 months forever
suffix sort 0.14 sec 0.25 sec 1.0 sec 7.6 sec 61 sec 84 sec ???
estimated
62
Ex: Ex:
0 1 2 3 4 5 6 7 8 9
same letter repeated N times. two copies of the same Java codebase.
form su"xes sorted su"xes
t w i n s t w i n s
w i n s t w i n s
i n s t w i n s
n s t w i n s
s t w i n s
t w i n s
w i n s i n s n s s
9 8 7 6 5 4 3 2 1 0
i i n n s s t t w w
n s n s t w i n s s s t w i n s t w w i i w i i n n i n n s s n s s s t w i n s t w i n s
LRS needs at least 1 + 2 + 3 + ... + D character compares, where D = length of longest match. Running time. Quadratic (or worse) in D for LRS (and also for sort).
63
64
Phase 0: sort on first character using key-indexed counting sort. Phase i: given array of suffixes sorted on first 2 characters,
i-1
[ahead]
65
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 1 16 3 4 5 6 15 14 13 12 10 0 9 11 7 2 8
0 a a a a a a a a a a a b b b b b c
b 0 a a a b a a a a b a a a c a b
a a a a b c b a b a a a a a 0 a a b c 0 a a a a b b a b a a a b c b 0 a a a a a a a a b b c b a c b a b b a b a a b a a b a a a a a a a a a a a a a a 0 a a 0 a 0 0
0 a a a a a b a a
0 a a a a a b a
a a a 0 a c a
0 b c b a b a a a a a 0 a 0 a a a 0 b a b a a a a a 0 a a 0
sorted
66
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 16 12 3 4 5 13 15 14 6 1 10 0 9 11 2 7 8
0 a a a a a a a a a a a b b b b b c
0 a a a a a a a b b b a a a a c b
a a a b a 0 a c a a b b a a b a
a a b c a 0 b a a a a a a a b
a b c b 0
0 c b a b a a a a a 0 b a b a a a a a 0 a b a a a a a 0
a a a a a a a b a
b a a a a a b a a
a b a a a 0 c a a
a c 0 b a
a a a 0 b a b a a a a a 0 c b a b a a a a a 0 0
b a b a a a a a 0 a a a 0 a a 0
sorted
67
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 16 15 14 3 12 13 4 5 1 10 6 2 11 0 9 7 8
0 a a a a a a a a a a a b b b b b c
0 a a a a a a a b b b a a a a c b
0 a a a a a b a a c a a b b b a
0 a a a b c a a b a a a a a b
b a 0 c b a a a a a a a b a
c b a b a a a a a 0 0 b a a a b b a a a a a a b b a a c 0 a a a a b a c 0 a b b a a a a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a b a a a a a 0 a 0 c b a b a a a a a 0 0 a a 0 a 0
sorted
68
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 16 15 14 13 12 3 4 5 10 1 6 11 2 9 0 7 8
0 a a a a a a a a a a a b b b b b c
0 a a a a a a a b b b a a a a c b
0 a a a a a b a a c a a b b b a
0 a a a b c a a b a a a a a b
0 a b c b a a a a a a a b a
0 c b a a a b a b a a a a
b a b a b a 0 c a a a a
a b a 0 c a b a b a a
b a a a a a 0 a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a 0 c a a b a a a a a 0 a 0 b a b a a a a a 0 a 0 0
69
inverse[]
0
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 16 15 14 3 12 13 4 5 1 10 6
0 + 4 = 4
2 11
9 + 4 = 13
0 9 7 8
0 a a a a a a a a a a a b b b b b c
14 9 12 4 7 8 11 16 17 15 10 13 5 6 3 2 1 0
0 a a a a a a a b b b a a a a c b
0 a a a a a b a a c a a b b b a
0 a a a b c a a b a a a a a b
b a 0 c b a a a a a a a b a
c b a b a a a a a 0 0 b a a a b b a a a a a a b b a a c 0 a a a a b a c 0 a b b a a a a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a b a a a a a 0 a 0 c b a b a a a a a 0 0 a a 0 a 0
4 5 6 7 8 9 10 11 12 13 14 15 16 17
Key compares not necessary for string keys. Use characters as index in an array.
We can develop sublinear-time sorts.
Input size is amount of data in keys (not number of keys). Not all of the data has to be examined.
3-way string quicksort is asymptotically optimal.
71