VOOZH about

URL: https://www.geeksforgeeks.org/dsa/boyer-moore-algorithm-good-suffix-heuristic/

โ‡ฑ Boyer Moore Algorithm | Good Suffix heuristic - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Boyer Moore Algorithm | Good Suffix heuristic

Last Updated : 12 Apr, 2025

Before diving into the Good Suffix Heuristic, it is recommended to first read about the Boyer-Moore Pattern Searching Algorithm and the Bad Character Heuristic to gain a clear understanding of how this algorithm optimizes pattern searching.

Refer Boyer Moore Algorithm for Pattern Searching for clear understanding of Boyer Moore Algo.

One important part of the Boyer-Moore algorithm is the Good Suffix Heuristic. When the algorithm encounters a mismatch, this heuristic helps by shifting the pattern to align with another occurrence of a similar part (suffix) of the pattern that has already matched. This reduces the number of checks needed, making the search faster.

Good Suffix Heuristic for Pattern Searching

Just like the Bad Character Heuristic, the Good Suffix Heuristic also involves a preprocessing step where a table is generated to optimize the pattern matching process. The Strong Good Suffix Heuristic is an important optimization in the Boyer-Moore algorithm for string pattern matching. It helps to skip unnecessary comparisons and efficiently shift the pattern when a mismatch occurs. Letโ€™s break down how it works and its associated preprocessing.

Let t be a substring of the text T that matches a substring of the pattern P. When a mismatch occurs, the pattern is shifted based on the following rules:

  1. Another occurrence of t in P
  2. Prefix of P matches the suffix of t
  3. Move the pattern past t

This heuristic helps optimize the search by allowing the pattern to skip over sections of the text that have already been matched, leading to faster string matching.

Case 1: Another occurrence of t in P matched with t in T

If there are other occurrences of the substring t in the pattern P, the pattern is shifted so that one of these occurrences aligns with the substring t in the text T. This allows the pattern to continue matching efficiently. For example-

๐Ÿ‘ Image

Explanation:

In the above example, we have a substring t of text T that matches with pattern P (in green) before the mismatch occurs at index 2. Now, we search for occurrences of t ("AB") within P. We find an occurrence starting at position 1 (highlighted with a yellow background). Therefore, we right shift the pattern by 2 positions to align the occurrence of t in P with t in T.

This is the weak rule of the original Boyer-Moore algorithm, which is not very effective. We will discuss the Strong Good Suffix rule shortly, which provides a more efficient approach.

Case 2: A prefix of P, which matches with suffix of t in T

It is not always guaranteed that we will find an occurrence of t in P. In some cases, there may be no exact occurrence at all. In such situations, we can instead look for a suffix of t that matches with a prefix of P. If a match is found, we can shift the pattern P to align the matched suffix of t with the prefix of P, allowing us to continue the search effectively. This method helps when direct matches for t are not found, and it ensures that the search can still progress efficiently. For example -

๐Ÿ‘ Image

Explanation:

In above example, we have got t (โ€œBABโ€) matched with P (in green) at index 2-4 before mismatch . But because there exists no occurrence of t in P we will search for some prefix of P which matches with some suffix of t. We have found prefix โ€œABโ€ (in the yellow background) starting at index 0 which matches not with whole t but the suffix of t โ€œABโ€ starting at index 3. So now we will shift pattern 3 times to align prefix with the suffix.

Case 3: P moves past t

If the above two cases are not satisfied, we will shift the pattern past the t. For example -

๐Ÿ‘ Image

Explanation:

If above example, there exist no occurrence of t (โ€œABโ€) in P and also there is no prefix in P which matches with the suffix of t. So, in that case, we can never find any perfect match before index 4, so we will shift the P past the t ie. to index 5.

Suppose the substring q = P[i to n] of the pattern P has matched with the substring t in the text T, and the character c = P[i-1] is the mismatching character.

Unlike the weak rule of the Good Suffix Heuristic, we now search for the substring t in P where t is not preceded by the character c. This means we look for occurrences of t in the pattern P that are not immediately preceded by the mismatching character c.

The closest occurrence of t in P, which is not preceded by c, is then aligned with t in T. The pattern P is shifted accordingly to align this occurrence of t in the pattern with the matched substring t in the text. This strong rule of the Good Suffix Heuristic enhances the efficiency of the Boyer-Moore algorithm by ensuring that the pattern is shifted more effectively, reducing unnecessary comparisons and speeding up the search process.

For example -

๐Ÿ‘ Image

Explanation:

Consider the scenario where a substring q = P[i to n] of the pattern P has matched with a substring t in the text T, and the mismatching character c = P[i-1] is found at position P[i-1]. Instead of searching for the occurrence of t in the pattern P that is preceded by the mismatching character c, we focus on finding the occurrence of t in Pnot preceded by c.

  • Step 1: Search for t in the pattern P. If the first occurrence of t is found at position 4, but this occurrence is preceded by the mismatching character c = "C", we skip it and continue searching.
  • Step 2: At position 1, we find another occurrence of t in P. This occurrence is preceded by the character A, which is not equal toc. Therefore, we shift the pattern by 6 positions to align this occurrence with the matched substring t in the text T.

This method helps avoid unnecessary comparisons and ensures that the pattern shifts efficiently, reducing the time complexity of the pattern matching process.

Preprocessing for Good Suffix Heuristic

As part of the preprocessing, an array shift is created. Each entry shift[i] stores the distance the pattern will shift when a mismatch occurs at position i-1. Essentially, shift[i] tells us how far we should shift the pattern when a mismatch happens after matching the suffix starting at position i in the pattern P.

1) Preprocessing for Strong Good Suffix

Before we dive into the preprocessing steps, itโ€™s important to understand the concept of borders. A border is a substring that is both a proper suffix and a proper prefix. For example, in the string "ccacc", "c" and "cc" are borders because they appear at both ends of the string, but "cca" is not a border.

During preprocessing, we calculate the array bpos (border positions). Each entry bpos[i] stores the starting index of the border for the suffix starting at position i in the given pattern P.

  • The suffix starting at position m has no border, so bpos[m] = m + 1 (where m is the length of the pattern).

The shift position is determined by these borders, and the pattern is shifted based on these positions. The following code shows the preprocessing algorithm for the Strong Good Suffix Heuristic:

Explanation of the Code

Initialization:

  • i = m and j = m + 1 set up the initial boundary for calculating the borders.
  • bpos[i] = j initializes the bpos array, marking the end of the pattern.

Main Loop:

  • The main loop starts from the end of the pattern (i = m), and compares the character at pat[i-1] with pat[j-1].
  • If the characters match, the loop proceeds with calculating the borders.
  • If a mismatch is found, we adjust j using bpos[j] and try to find a new border.

Shifting the Pattern:

  • If shift[j] == 0, we calculate the shift based on the current mismatch and store it in shift[j] = j - i.

Example for bpos[i] = j:

Consider the pattern P = "ABBABAB" where m = 7. Hereโ€™s how the preprocessing works:

  • The suffix "AB" starting at position i = 5 has no border, so bpos[5] = 7.
  • The suffix "BABAB" starting at position i = 2 has a border "BAB", so bpos[2] = 4.

This helps in determining where to shift the pattern when a mismatch occurs.

2) Preprocessing for Case 2 (Widest Border of the Pattern)

In this preprocessing step, we determine the widest border of the pattern that is contained in each suffix. The widest border refers to the longest substring that serves as both a proper prefix and a proper suffix of the pattern.

  • bpos[0] stores the starting position of the widest border of the entire pattern.
  • This value is initially set in all free entries of the shift array. If the suffix length becomes smaller than bpos[0], the algorithm proceeds to the next-wider border.

This preprocessing ensures that when a mismatch occurs, the pattern is shifted optimally, either by aligning borders or skipping over parts of the text that are already matched.

Following is the implementation of the search algorithm -


Output
pattern occurs at shift = 0
pattern occurs at shift = 5
Comment
Article Tags: