Finding all repeated substrings in a string and how often they appear

alvitawa picture alvitawa · May 28, 2016 · Viewed 8k times · Source

Problem:

I need all the sequences of characters that meet the following:

  1. Sequence of characters must be present more than once ((LE, 1) is thus invalid).
  2. Sequence of characters must be longer than one character ((M, 2) is thus invalid).
  3. Sequence of characters must not be part of a longer existing sequence that is present the same number of times ((LI, 2) is thus invalid if (LIO, 2) is present).

So, if the input string was: KAKAMNENENELIOLELIONEM$ The output would be:

(KA, 2)
(NE, 4)
(LIO, 2)

It also needs to be fast, it should be able to solve a 1000 character long string in a reasonable amount of time.

What I have tried:

Getting amount of branches from suffix tree:

Editing this suffix tree -creating librabry(Python-Suffix-Tree), I made a program that gives somewhat erroneus results.

I added this function to the SuffixTree class in suffix_tree.py:

def get_repeated_substrings(self):
    curr_index = self.N
    values = self.edges.values()
    values = sorted(values, key=lambda x: x.dest_node_index)
    data = []  # index = edge.dest_node_index - 1
    for edge in values:
        if edge.source_node_index == -1:
            continue
        top = min(curr_index, edge.last_char_index)
        data.append([edge.source_node_index,
                self.string[edge.first_char_index:top+1]])
    repeated_substrings = {}
    source_node_indexes = [i[0] for i in data]
    nodes_pointed_to = set(source_node_indexes)
    nodes_pointed_to.remove(0)
    for node_pointed_to in nodes_pointed_to:
        presence = source_node_indexes.count(node_pointed_to)
        key = data[node_pointed_to-1][1]
        if key not in repeated_substrings:
            repeated_substrings[key] = 0
        repeated_substrings[key] += presence
    for key in repeated_substrings:
        if len(key) > 1:
            print(key, repeated_substrings[key])

And then used it and the rest of the library like this:

from lib.suffix_tree import SuffixTree

st = SuffixTree("KAKANENENELIOLELIONE$")
print(st.get_repeated_substrings())

Output:

KA 2
NE 7
LIO 2
IO 4

get_repeated_substrings() basically goes through all connections between nodes (called edges in this library) and saves how many more connections the node it points towards has (it saves it to repeated_substrings), and then it prints the saved values that are more than one character long.

It appends the number of connections to the amount it already had for that sequence, which works most of the time, but as you can see in the output above, it resulted in an incorrect value for 'NE' (7, it should have been 4). After solving this problem, I realised that this method won't detect patterns made out of the same character (AA, BB), and other malfunctions. My conclusion: Either there is no way to solve it with suffix trees, or I am doing something very wrong.

Other methods:

I have also tried some more straightforward ways, consisting of looping through stuff, but that hasn't been a success either:

import copy

string = 'kakabaliosie'

for main_char in set(string):
    indices = []
    for char_i, char in enumerate(string):
        if main_char == char:
            indices.append(char_i)
    relative = 1
    while True
        for index in indices:
            other_indices = copy.deepcopy(indices)
            other_indices.remove(index)
            for other_index in other_indices:

(can't manage to finish it)

Question:

How can I make a program that does what I want?

Answer

Rerito picture Rerito · Jun 23, 2016

Your suffix tree approach is the right way.

Getting the set of matches and their number of occurrences

Basically what you need to do is traverse the tree in a BFS fashion. Starting from the root's children you would recursively count the number of leaves reachable by each node. This would lead to a method for Node which you would call on the root. Here is a possible implementation:

def count_leaves(self, stree):
    leaves_count = 0
    for child in [stree.nodes[x.dest_node_index] for x in self.edges.values()]:
        child_leaves_count = child.count_leaves(stree)
        if 0 == child_leaves_count:
            # The child node is a leaf...
            leaves_count = leaves_count + 1
        else:
            # The child node is an internal node, we add up the number of leaves it can reach
            leaves_count = leaves_count + child_leaves_count
    self.leaves_count = leaves_count
    return leaves_count

Now, each node is labelled with the number of leaves it can reach.

Then, interesting properties of suffix tree will help you filter out automatically substrings that don't match some of your requirements:

  • If a string is only present once, then you necessarily end up in a transition to a leaf node (said transition has at least two characters because we don't count the end token $). This means it isn't an explicit state and thus we don't even consider it.
  • If we have (LI, 2) and (LIO, 2), then (LI, 2) is an implicit state of the suffix tree, which means it is in the middle of an edge. Since we only consider substrings with an explicit state (i.e. that end up in a node), we won't ever find (LI, 2) to begin with.
  • Internal nodes have at least 2 children, otherwise they would be a leaf and have none.

Now iterating over the internal nodes will get you both the list of substrings and their number of occurences in the input string (you need to filter out the nodes representing a 1 character substring).

You will find below a string representation of the suffix tree for your input string. This will help you visualize which substrings are matches.

- O - N E M $ - ##  
    - L E L I O N E M $ - ##  
- I O - N E M $ - ##  
      - L E L I O N E M $ - ##  
- $ - ##  
- E - M $ - ##  
      L I O - N E M $ - ##
              L E L I O N E M $ - ##
      N E - L I O L E L I O N E M $ - ##
            N E L I O L E L I O N E M $ - ##
- K A - M N E N E N E L I O L E L I O N E M $ - ##
        K A M N E N E N E L I O L E L I O N E M $ - ##
- L - E L I O N E M $ - ##
      I O - N E M $ - ##
            L E L I O N E M $ - ##
- A - M N E N E N E L I O L E L I O N E M $ - ##
      K A M N E N E N E L I O L E L I O N E M $ - ##
- M - $ - ##
      N E N E N E L I O L E L I O N E M $ - ##
- N E - M $ - ##
        L I O L E L I O N E M $ - ##
        N E - L I O L E L I O N E M $ - ##
              N E L I O L E L I O N E M $ - ##

This leads to the following output:

(IO, 2)
(ELIO, 2)
(ENE, 2)
(KA, 2)
(LIO, 2)
(NE, 4)
(NENE, 2)

Excluding redundant matches for a fixed number of occurrences (N)

We will now assume for instance that LIO, and IO should be filtered out because, just like ELIO, they have two matches. Such substrings of a larger match will be called "redundant matches". The following puzzle remains unsolved: given the set of all matches that occur exactly N times a.k.a N-matches (where N is a fixed integer), how can we filter out the "redundant" ones?

We start by making a priority queue from the set of N-matches ordered by decreasing length. We will then build iteratively a Generalized Suffix Tree (GST) of these matches to identify the redundant matches. To do so the algorithm is the following:

  1. For each element in the heap (taken at the top), test if this element is a substring of one of the elements already registered in the GST
    • If not: insert it into the GST and append it to the list of "good matches".
    • Else: Skip it since another larger match is already registered ... And try with the next element
  2. Once the heap is empty, the list of good matches contains all the non-redundant N-matches.

This leads to the following pseudo Python code:

match_heap = heapify(set_of_matches)
good_matches = []
match_gst = generalized_suffix_tree()
while (not match_heap.empty()):
    top_match = match_heap.top()
    if (not match_gst.is_substring(top_match.string)):
        gst_match.insert(top_match.string)
        good_matches.append(top_match)
    else:
        # The given match is a substring of an already registered, bigger match
        # We skip it
return good_matches

Filtering all the redundant matches

Now that we can filter the redundant matches for N-matches, it is easy to filter all of them out of our global set of matches. We gather matches in buckets using their number of occurrences, then we apply the algorithm of the previous section on each bucket.

Notes

To implement the algorithm above, you need to have a Generalized Suffix Tree implementation, this is a bit different from a regular suffix tree. If you cannot find a Python implementation, you could always adapt the one you got. See this question to get hints on how to do so.