algorithm Substring Search Introduction to Rabin-Karp Algorithm

Help us to keep this website almost Ad Free! It takes only 10 seconds of your time:
> Step 1: Go view our video on YouTube: EF Core Bulk Extensions
> Step 2: And Like the video. BONUS: You can also share it!

Example

Rabin-Karp Algorithm is a string searching algorithm created by Richard M. Karp and Michael O. Rabin that uses hashing to find any one of a set of pattern strings in a text.

A substring of a string is another string that occurs in. For example, ver is a substring of stackoverflow. Not to be confused with subsequence because cover is a subsequence of the same string. In other words, any subset of consecutive letters in a string is a substring of the given string.

In Rabin-Karp algorithm, we'll generate a hash of our pattern that we are looking for & check if the rolling hash of our text matches the pattern or not. If it doesn't match, we can guarantee that the pattern doesn't exist in the text. However, if it does match, the pattern can be present in the text. Let's look at an example:

Let's say we have a text: yeminsajid and we want to find out if the pattern nsa exists in the text. To calculate the hash and rolling hash, we'll need to use a prime number. This can be any prime number. Let's take prime = 11 for this example. We'll determine hash value using this formula:

(1st letter) X (prime) + (2nd letter) X (prime)¹ + (3rd letter) X (prime)² X + ......

We'll denote:

a -> 1    g -> 7    m -> 13   s -> 19   y -> 25
b -> 2    h -> 8    n -> 14   t -> 20   z -> 26
c -> 3    i -> 9    o -> 15   u -> 21
d -> 4    j -> 10   p -> 16   v -> 22
e -> 5    k -> 11   q -> 17   w -> 23
f -> 6    l -> 12   r -> 18   x -> 24

The hash value of nsa will be:

 14 X 11⁰ + 19 X 11¹ + 1 X 11² = 344

Now we find the rolling-hash of our text. If the rolling hash matches with the hash value of our pattern, we'll check if the strings match or not. Since our pattern has 3 letters, we'll take 1st 3 letters yem from our text and calculate hash value. We get:

25 X 11⁰ + 5 X 11¹ + 13 X 11² = 1653

This value doesn't match with our pattern's hash value. So the string doesn't exists here. Now we need to consider the next step. To calculate the hash value of our next string emi. We can calculate this using our formula. But that would be rather trivial and cost us more. Instead, we use another technique.

  • We subtract the value of the First Letter of Previous String from our current hash value. In this case, y. We get, 1653 - 25 = 1628.
  • We divide the difference with our prime, which is 11 for this example. We get, 1628 / 11 = 148.
  • We add new letter X (prime)ᵐ⁻¹, where m is the length of the pattern, with the quotient, which is i = 9. We get, 148 + 9 X 11² = 1237.

The new hash value is not equal to our patterns hash value. Moving on, for n we get:

Previous String: emi
First Letter of Previous String: e(5)
New Letter: n(14)
New String: "min"
1237 - 5 = 1232
1232 / 11 = 112
112 + 14 X 11² = 1806

It doesn't match. After that, for s, we get:

Previous String: min
First Letter of Previous String: m(13)
New Letter: s(19)
New String: "ins"
1806 - 13 = 1793
1793 / 11 = 163
163 + 19 X 11² = 2462

It doesn't match. Next, for a, we get:

Previous String: ins
First Letter of Previous String: i(9)
New Letter: a(1)
New String: "nsa"
2462 - 9 = 2453
2453 / 11 = 223
223 + 1 X 11² = 344

It's a match! Now we compare our pattern with the current string. Since both the strings match, the substring exists in this string. And we return the starting position of our substring.

The pseudo-code will be:

Hash Calculation:

Procedure Calculate-Hash(String, Prime, x):
hash := 0                                  // Here x denotes the length to be considered
for m from 1 to x                          // to find the hash value
    hash := hash + (Value of String[m])ᵐ⁻¹
end for
Return hash

Hash Recalculation:

Procedure Recalculate-Hash(String, Curr, Prime, Hash):
Hash := Hash - Value of String[Curr]  //here Curr denotes First Letter of Previous String
Hash := Hash / Prime
m := String.length
New := Curr + m - 1
Hash := Hash + (Value of String[New])ᵐ⁻¹
Return Hash

String Match:

Procedure String-Match(Text, Pattern, m):
for i from m to Pattern-length + m - 1
    if Text[i] is not equal to Pattern[i]
        Return false
    end if
end for
Return true

Rabin-Karp:

Procedure Rabin-Karp(Text, Pattern, Prime):
m := Pattern.Length
HashValue := Calculate-Hash(Pattern, Prime, m)
CurrValue := Calculate-Hash(Pattern, Prime, m)
for i from 1 to Text.length - m
    if HashValue == CurrValue and String-Match(Text, Pattern, i) is true
        Return i
    end if
    CurrValue := Recalculate-Hash(String, i+1, Prime, CurrValue)
end for
Return -1

If the algorithm doesn't find any match, it simply returns -1.

This algorithm is used in detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical here. Again, Knuth-Morris-Pratt algorithm or Boyer-Moore String Search algorithm is faster single pattern string searching algorithm, than Rabin-Karp. However, it is an algorithm of choice for multiple pattern search. If we want to find any of the large number, say k, fixed length patterns in a text, we can create a simple variant of the Rabin-Karp algorithm.

For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm).



Got any algorithm Question?