![]() |
VOOZH | about |
Given two strings X and Y, find the Longest Common Substring of X and Y.
Naive [O(N*M2)] and Dynamic Programming [O(N*M)] approaches are already discussed here.
In this article, we will discuss a linear time approach to find LCS using suffix tree (The 5th Suffix Tree Application).
Here we will build generalized suffix tree for two strings X and Y as discussed already at:
Generalized Suffix Tree 1
Lets take same example (X = xabxa, and Y = babxba) we saw in Generalized Suffix Tree 1.
We built following suffix tree for X and Y there:
This is generalized suffix tree for xabxa#babxba$
In above, leaves with suffix indices in [0,4] are suffixes of string xabxa and leaves with suffix indices in [6,11] are suffixes of string babxa. Why ??
Because in concatenated string xabxa#babxba$, index of string xabxa is 0 and it's length is 5, so indices of it's suffixes would be 0, 1, 2, 3 and 4. Similarly index of string babxba is 6 and it's length is 6, so indices of it's suffixes would be 6, 7, 8, 9, 10 and 11.
With this, we can see that in the generalized suffix tree figure above, there are some internal nodes having leaves below it from
Following figure shows the internal nodes marked as "XY", "X" or "Y" depending on which string the leaves belong to, that they have below themselves.
What these "XY", "X" or "Y" marking mean ?
Path label from root to an internal node gives a substring of X or Y or both.
For node marked as XY, substring from root to that node belongs to both strings X and Y.
For node marked as X, substring from root to that node belongs to string X only.
For node marked as Y, substring from root to that node belongs to string Y only.
By looking at above figure, can you see how to get LCS of X and Y ?
By now, it should be clear that how to get common substring of X and Y at least.
If we traverse the path from root to nodes marked as XY, we will get common substring of X and Y.
Now we need to find the longest one among all those common substrings.
Can you think how to get LCS now ? Recall how did we get Longest Repeated Substring in a given string using suffix tree already.
The path label from root to the deepest node marked as XY will give the LCS of X and Y. The deepest node is highlighted in above figure and path label "abx" from root to that node is the LCS of X and Y.
Output:
Longest Common Substring in xabxac and abcabxabcd is: abxa, of length: 4 Longest Common Substring in xabxaabxa and babxba is: abx, of length: 3 Longest Common Substring in GeeksforGeeks and GeeksQuiz is: Geeks, of length: 5 Longest Common Substring in OldSite:GeeksforGeeks.org and NewSite:GeeksQuiz.com is: Site:Geeks, of length: 10 Longest Common Substring in abcde and fghie is: e, of length: 1 Longest Common Substring in pqrst and uvwxyz is: No common substring
If two strings are of size M and N, then Generalized Suffix Tree construction takes O(M+N) and LCS finding is a DFS on tree which is again O(M+N).
So overall complexity is linear in time and space.
Followup:
We have published following more articles on suffix tree applications: