![]() |
VOOZH | about |
Given a set of n strings arr[], find the smallest string that contains each string in the given set as substring. We may assume that no string in arr[] is substring of another string.
Examples:
Input: arr[] = {"geeks", "quiz", "for"}
Output: geeksquizfor
Explanation: "geeksquizfor" contains all the three strings of arr[]:
- "geeksquizfor" contains "geeks".
- "geeksquizfor" contains "quiz".
- "geeksquizfor" contains "for".
Input: arr[] = {"catg", "ctaagt", "gcta", "ttca", "atgcatc"}
Output: gctaagttcatgcatc
Explanation:
- "gctaagttcatgcatc" contains "catg".
- "gctaagttcatgcatc" contains "ctaagt".
- "gctaagttcatgcatc" contains "gcta".
- "gctaagttcatgcatc" contains "ttca".
- "gctaagttcatgcatc" contains "atgcatc".
Shortest Superstring Greedy Approximate Algorithm
Shortest Superstring Problem is a NP Hard problem. A solution that always finds shortest superstring takes exponential time. Below is an Approximate Greedy algorithm.
Let arr[] be given set of strings.
1) Create an auxiliary array of strings, temp[]. Copy contents
of arr[] to temp[]
2) While temp[] contains more than one strings
a) Find the most overlapping string pair in temp[]. Let this
pair be 'a' and 'b'.
b) Replace 'a' and 'b' with the string obtained after combining
them.
3) The only string left in temp[] is the result, return it.
Two strings are overlapping if prefix of one string is same suffix of other string or vice versa. The maximum overlap mean length of the matching prefix and suffix is maximum.
Working of above Algorithm:
arr[] = {"catgc", "ctaagt", "gcta", "ttca", "atgcatc"}
Initialize:
temp[] = {"catgc", "ctaagt", "gcta", "ttca", "atgcatc"}
The most overlapping strings are "catgc" and "atgcatc"
(Suffix of length 4 of "catgc" is same as prefix of "atgcatc")
Replace two strings with "catgcatc", we get
temp[] = {"catgcatc", "ctaagt", "gcta", "ttca"}
The most overlapping strings are "ctaagt" and "gcta"
(Prefix of length 3 of "ctaagt" is same as suffix of "gcta")
Replace two strings with "gctaagt", we get
temp[] = {"catgcatc", "gctaagt", "ttca"}
The most overlapping strings are "catgcatc" and "ttca"
(Prefix of length 2 of "catgcatc" as suffix of "ttca")
Replace two strings with "ttcatgcatc", we get
temp[] = {"ttcatgcatc", "gctaagt"}
Now there are only two strings in temp[], after combing
the two in optimal way, we get tem[] = {"gctaagttcatgcatc"}
Since temp[] has only one string now, return it.
Below is the implementation of the above algorithm.
The Shortest Superstring is gctaagttcatgcatc
The time complexity of this algorithm is O(n^3 * m), where n is the number of strings in the input array and m is the maximum length of any string in the array. This is because the main loop runs n-1 times and the findOverlappingPair function takes O(m) time, and it is called n^2 times.
The space complexity is O(n * m), which is the space required to store the input array and the result string.
Performance of above algorithm:
The above Greedy Algorithm is proved to be 4 approximate (i.e., length of the superstring generated by this algorithm is never beyond 4 times the shortest possible superstring). This algorithm is conjectured to 2 approximate (nobody has found case where it generates more than twice the worst). Conjectured worst case example is {abk, bkc, bk+1}. For example {"abb", "bbc", "bbb"}, the above algorithm may generate "abbcbbb" (if "abb" and "bbc" are picked as first pair), but the actual shortest superstring is "abbbc". Here ratio is 7/5, but for large k, ration approaches 2.
Another Approach:
By "greedy approach" I mean: each time we merge the two strings with a maximum length of overlap, remove them from the string array, and put the merged string into the string array.
Then the problem becomes to: find the shortest path in this graph which visits every node exactly once. This is a Travelling Salesman Problem.
Apply Travelling Salesman Problem DP solution. Remember to record the path.
Below is the implementation of the above approach:
The Shortest Superstring is gctaagttcatgcatc
Time complexity:O(n^2 * 2^n), where N is the length of the string array.
Auxiliary Space: O(2^N * N).
There exist better approximate algorithms for this problem. Please refer to below link.
Shortest Superstring Problem | Set 2 (Using Set Cover)
This is actually bitmasking problem: if we look at our strings as nodes, then we can evaluate distance between one string and another, for example for abcde and cdefghij distance is 5, because we need to use 5 more symbols fghij to continue first string to get the second. Note, that this is not symmetric, so our graph is oriented.
The Shortest Superstring is catgccatcagta
Time complexity: O(2^n*n^2*M), where M is the length of answer
Auxiliary Space: O(2^n*n*M) as well.