![]() |
VOOZH | about |
Pre-requisite: Suffix Array.What are k-mers? The term k-mers typically refers to all the possible substrings of length k that are contained in a string. Counting all the k-mers in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. What is a Suffix Array? A suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms. More information can be found here.Problem: We are given a string str and an integer k. We have to find all pairs (substr, i) such that substr is a length - k substring of str that occurs exactly i times.
Steps involved in the approach:
Let's take the word "banana$" as an example.
Step 1: Compute the suffix array of the given text.
6 $
5 a$
3 ana$
1 anana$
0 banana$
4 na$
2 nana$
Step 2: Iterate through the suffix array keeping "curr_count".
curr_count Valid Pair
6 $ 1
5 a$ 1
3 ana$ 1 (a$, 1)
1 anana$ 1
0 banana$ 2 (an, 2)
4 na$ 1 (ba, 1)
2 nana$ 1 (na, 2)
Examples:
Input : banana$ // Input text
Output : (a$, 1) // k- mers
(an, 2)
(ba, 1)
(na, 2)
Input : geeksforgeeks
Output : (ee, 2)
(ek, 2)
(fo, 1)
(ge, 2)
(ks, 2)
(or, 1)
(sf, 1)
The following is the C code for approach explained above:
Output:
Input Text: banana$
k-mers:
(a$, 1)
(an, 2)
(ba, 1)
(na, 2)
Time Complexity: O(s*len_text*log(len_text)), assuming s is the length of the longest suffix.