Theory PDF
Theory PDF
Theory PDF
Strings are defined as a stream of characters. Strings are used to represent text and
are generally represented by enclosing text within quotes as: "This is a sample
string!".
Different programming languages have different ways of declaring and using Strings.
We will learn to implement strings in C/C++ and Java.
Strings in C/C++
In C/C++, Strings are defined as an array of characters. The difference between a
character array and a string is that the string is terminated with a special character ‘’.
char str_name[size];
In the above syntax, str_name is any name given to the string variable and size is
used to define the length of the string, i.e the number of characters strings will store.
Please keep in mind that there is an extra terminating character which is the Null
character ('') used to indicate the termination of string which differs strings from
normal character arrays.
Initializing a String: A string can be initialized in different ways. We will explain this
with the help of an example. Below is an example to declare a string with the name
as str and initialize it with “GeeksforGeeks”.
Printing a string array: Unlike arrays we do not need to print a string, character by
character. The C/C++ language does not provide an inbuilt data type for strings but it
has an access specifier “%s” which can be used to directly print and read strings.
#include<bits/stdc++.h>
int main()
{
// declare and initialize string
char str[] = "Geeks";
// print string
printf("%s",str);
return 0;
}
Run
Output:
Geeks
#include<bits/stdc++.h>
int main()
{
// declare and initialize string
char str[] = "GeeksforGeeks";
return 0;
}
Run
Output:
String is : GeeksforGeeks
Declaration Syntax: Declaring string using string class is simple and can be done
using the string keyword as shown below.
Sample Program:
#include<bits/stdc++.h>
using namespace std;
int main()
{
// declare and initialize string
string str = "Geeks";
// print string
cout<<str;
return 0;
}
Run
Output:
Geeks
To learn about std::string in details, refer: std::string class in C++.
Strings in Java
String is a sequence of characters. In java, objects of String are immutable which
means a constant and cannot be changed once created.
Creating a String
There are two ways to create string in Java:
• String literal
String s = “GeeksforGeeks”;
String Methods
"GeeksforGeeks".length(); // returns 13
3. String substring (int i): Return the substring from the ith index character to
end.
4. String substring (int i, int j): Returns the substring from i to j-1 index.
6. String s1 = ”Geeks”;
7. String s2 = ”forGeeks”;
9. int indexOf (String s): Returns the index within the string of the first
occurrence of the specified string.
12. int indexOf (String s, int i): Returns the index within the string of the first
occurrence of the specified string, starting at the specified index.
15. Int lastIndexOf( String s): Returns the index within the string of the last
occurrence of the specified string.
18. boolean equals( Object otherObj): Compares this string to the specified
object.
26.
35.
40. String toLowerCase(): Converts all the characters in the String to lower
case.
43. String toUpperCase(): Converts all the characters in the String to upper
case.
46. String trim(): Returns the copy of the String, by removing whitespaces at
both ends. It does not affect whitespaces in the middle.
49. String replace (char oldChar, char newChar): Returns new string by
replacing all occurrences of oldChar with newChar.
import java.io.*;
import java.util.*;
class Test
{
public static void main (String[] args)
{
String s= "GeeksforGeeks";
// or String s= new String ("GeeksforGeeks");
// Converting cases
String word1 = "GeeKyMe";
System.out.println("Changing to lower Case " +
word1.toLowerCase());
// Converting cases
String word2 = "GeekyME";
System.out.println("Changing to UPPER Case " +
word1.toUpperCase());
// Replacing characters
String str1 = "feeksforfeeks";
System.out.println("Original String " + str1);
String str2 = "feeksforfeeks".replace('f' ,'g') ;
System.out.println("Replaced f with g -> " + str2);
}
}
Run
Output :
String length = 13
Character at 3rd position = k
Substring ksforGeeks
Substring = eks
Concatenated string = GeeksforGeeks
Index of Share 6
Index of a = 8
Checking Equality false
Checking Equality true
Checking Equality false
If s1 = s2 false
Changing to lower Case geekyme
Changing to UPPER Case GEEKYME
Trim the word Learn Share Learn
Original String feeksforfeeks
Replaced f with g -> geeksgorgeeks
Naive Pattern Searching Algorithm
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[],
char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
Input: txt[] = "AABAACAADAABAABA"
pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12
Naive Pattern Searching: The idea is to slide the pattern over text one by one and
check for a match. If a match is found, then slides by 1 again to check for
subsequent matches.
That is check for the match of the first character of the pattern in the string, if it
matches then check for the subsequent characters of the pattern with the respective
characters of the string. If a mismatch is found then move forward in the string.
// Driver Code
int main()
{
char txt[] = "AABAACAADAABAAABAA";
char pat[] = "AABA";
search(pat, txt);
return 0;
}
Run
Java
Output:
Pattern found at index 0
Pattern found at index 9
Pattern found at index 13
What is the best case? The best case occurs when the first character of the pattern
is not present in text at all.
txt[] = "AABCCAADDEE";
pat[] = "FAA";
Run
The number of comparisons in best case is O(n).
What is the worst case ? The worst case of Naive Pattern Searching occurs in
following scenarios.
txt[] = "AAAAAAAAAAAAAAAAAA";
pat[] = "AAAAA";
Run
2. Worst case also occurs when only the last character is different.
txt[] = "AAAAAAAAAAAAAAAAAB";
pat[] = "AAAAB";
Run
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
The Naive String Matching algorithm slides the pattern one by one. After each slide,
one by one it checks characters at the current shift and if all characters match then it
prints the match.
Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one.
But unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the
pattern with the hash value of current substring of text, and if the hash values match
then only it starts matching individual characters. So Rabin Karp algorithm needs to
calculate hash values for following strings.
1. Pattern itself.
2. All the substrings of the text of length m, that is of the length of pattern string.
Since we need to efficiently calculate hash values for all the substrings of size m of
text, we must have a hash function which has the following property.
Hash at the next shift must be efficiently computable from the current hash value and
next character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently
computable from hash(txt[s .. s+m-1]) and txt[s+m] i.e., hash(txt[s+1 ..
s+m])= rehash(txt[s+m], hash(txt[s .. s+m-1]) and rehash must be O(1) operation.
The hash function suggested by Rabin and Karp calculates an integer value. The
integer value for a string is numeric value of a string. For example, if all possible
characters are from 1 to 10, the numeric value of "122" will be 122. The number of
possible characters is higher than 10 (256 in general) and pattern length can be
large. So the numeric values cannot be practically stored as an integer. Therefore,
the numeric value is calculated using modular arithmetic to make sure that the hash
values can be stored in an integer variable (can fit in memory words). To do
rehashing, we need to take off the most significant digit and add the new least
significant digit for in hash value. Rehashing is done using the following formula.
Where,
hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)
#include<stdio.h>
#include<string.h>
Time Complexity: The average and best case running time of the Rabin-Karp
algorithm is O(n+m), but its worst-case time is O(nm). Worst case of Rabin-Karp
algorithm occurs when all characters of pattern and text are the same as the hash
values of all the substrings of txt[] match with the hash value of pat[]. For example
pat[] = "AAA" and txt[] = "AAAAAAA".
Examples:
We have discussed the Naive pattern searching algorithm and the Rabin-Karp
algorithm for searching patterns. The worst case complexity of both of the algorithms
is O(n*m). Here, we will discuss a new algorithm for searching patterns, KMP
algorithm. The time complexity of KMP algorithm is O(n) in the worst case.
KMP (Knuth Morris Pratt) Pattern Searching
The Naive pattern searching algorithm doesn’t work well in cases where we see
many matching characters followed by a mismatching character. Following are some
examples.
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
The KMP matching algorithm uses degenerating property (pattern having the same
sub-patterns appearing more than once in the pattern) of the pattern and improves
the worst case complexity to O(n). The basic idea behind KMP’s algorithm is:
whenever we detect a mismatch (after some matches), we already know some of the
characters in the text of the next window. We take advantage of this information to
avoid matching the characters that we know will anyway match. Let us consider the
below example to understand this.
Matching Overview
txt = "AAAAABAAABA"
pat = "AAAA"
Need of Preprocessing?
An important question arises from the above explanation,
how to know how many characters to be skipped. To know this,
we pre-process pattern and prepare an integer array
lps[] that tells us the count of characters to be skipped.
Preprocessing Overview:
Note : lps[i] could also be defined as longest prefix which is also proper suffix.
We need to use properly at one place to make sure that the whole substring is
not considered.
How to use lps[] to decide the next positions (or to know a number of
characters to be skipped)?
i = 0, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 1, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 2, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++
i = 3, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 4, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3
i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3
i = 5, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1
i = 5, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0
i = 5, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.
i = 6, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
i = 7, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
C++
#include <bits/stdc++.h>
using namespace std;
if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
}
len = 0, i = 0.
lps[0] is always 0, we move
to i = 1
len = 0, i = 1.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[1] = 1, i = 2
len = 1, i = 2.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3
len = 2, i = 3.
Since pat[len] and pat[i] do not match, and len > 0,
set len = lps[len-1] = lps[1] = 1
len = 1, i = 3.
Since pat[len] and pat[i] do not match and len > 0,
len = lps[len-1] = lps[0] = 0
len = 0, i = 3.
Since pat[len] and pat[i] do not match and len = 0,
Set lps[3] = 0 and i = 4.
We know that characters pat
len = 0, i = 4.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5
len = 1, i = 5.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6
len = 2, i = 6.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7
len = 3, i = 7.
Since pat[len] and pat[i] do not match and len > 0,
set len = lps[len-1] = lps[2] = 2
len = 2, i = 7.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8