Counting k-mers via Suffix Array

Last Updated : 24 Jul, 2024

Pre-requisite: Suffix Array.What are k-mers? The term k-mers typically refers to all the possible substrings of length k that are contained in a string. Counting all the k-mers in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. What is a Suffix Array? A suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms. More information can be found here.Problem: We are given a string str and an integer k. We have to find all pairs (substr, i) such that substr is a length - k substring of str that occurs exactly i times.

Steps involved in the approach:

Let's take the word "banana$" as an example.

Step 1: Compute the suffix array of the given text.

 6 $ 
 5 a$
 3 ana$
 1 anana$
 0 banana$
 4 na$ 
 2 nana$

Step 2: Iterate through the suffix array keeping "curr_count".

If the length of current suffix is less than k, then skip the iteration. That is, if k = 2, then iteration would be skipped when current suffix is $.
If the current suffix begins with the same length - k substring as the previous suffix, then increment curr_count. For example, during fourth iteration current suffix "anana$" starts with same substring of length k "an" as previous suffix "ana$" started with. So, we will increment curr_count in this case.
If condition 2 is not satisfied, then if length of previous suffix is equal to k, then that it is a valid pair and we will output it along with its current count, otherwise, we will skip that iteration.

 curr_count Valid Pair
 6 $ 1 
 5 a$ 1
 3 ana$ 1 (a$, 1)
 1 anana$ 1
 0 banana$ 2 (an, 2)
 4 na$ 1 (ba, 1) 
 2 nana$ 1 (na, 2)

Examples:

Input : banana$ // Input text
Output : (a$, 1) // k- mers
 (an, 2)
 (ba, 1)
 (na, 2)

Input : geeksforgeeks
Output : (ee, 2) 
 (ek, 2)
 (fo, 1)
 (ge, 2)
 (ks, 2)
 (or, 1)
 (sf, 1)

The following is the C code for approach explained above:

Output:

Input Text: banana$ 
k-mers: 
(a$, 1)
(an, 2)
(ba, 1)
(na, 2)

Time Complexity: O(s*len_text*log(len_text)), assuming s is the length of the longest suffix.

Comment

Article Tags:

Strings

Advanced Data Structure

DSA

Suffix-Array

URL: https://www.geeksforgeeks.org/dsa/counting-k-mers-via-suffix-array/

⇱ Counting k-mers via Suffix Array - GeeksforGeeks

Counting k-mers via Suffix Array

Explore