I need all the sequences of characters that meet the following:
So, if the input string was: KAKAMNENENELIOLELIONEM$
The output would be:
(KA, 2)
(NE, 4)
(LIO, 2)
It also needs to be fast, it should be able to solve a 1000 character long string in a reasonable amount of time.
Editing this suffix tree -creating librabry(Python-Suffix-Tree), I made a program that gives somewhat erroneus results.
I added this function to the SuffixTree class in suffix_tree.py:
def get_repeated_substrings(self):
curr_index = self.N
values = self.edges.values()
values = sorted(values, key=lambda x: x.dest_node_index)
data = [] # index = edge.dest_node_index - 1
for edge in values:
if edge.source_node_index == -1:
continue
top = min(curr_index, edge.last_char_index)
data.append([edge.source_node_index,
self.string[edge.first_char_index:top+1]])
repeated_substrings = {}
source_node_indexes = [i[0] for i in data]
nodes_pointed_to = set(source_node_indexes)
nodes_pointed_to.remove(0)
for node_pointed_to in nodes_pointed_to:
presence = source_node_indexes.count(node_pointed_to)
key = data[node_pointed_to-1][1]
if key not in repeated_substrings:
repeated_substrings[key] = 0
repeated_substrings[key] += presence
for key in repeated_substrings:
if len(key) > 1:
print(key, repeated_substrings[key])
And then used it and the rest of the library like this:
from lib.suffix_tree import SuffixTree
st = SuffixTree("KAKANENENELIOLELIONE$")
print(st.get_repeated_substrings())
Output:
KA 2
NE 7
LIO 2
IO 4
get_repeated_substrings() basically goes through all connections between nodes (called edges in this library) and saves how many more connections the node it points towards has (it saves it to repeated_substrings), and then it prints the saved values that are more than one character long.
It appends the number of connections to the amount it already had for that sequence, which works most of the time, but as you can see in the output above, it resulted in an incorrect value for 'NE' (7, it should have been 4). After solving this problem, I realised that this method won't detect patterns made out of the same character (AA, BB), and other malfunctions. My conclusion: Either there is no way to solve it with suffix trees, or I am doing something very wrong.
I have also tried some more straightforward ways, consisting of looping through stuff, but that hasn't been a success either:
import copy
string = 'kakabaliosie'
for main_char in set(string):
indices = []
for char_i, char in enumerate(string):
if main_char == char:
indices.append(char_i)
relative = 1
while True
for index in indices:
other_indices = copy.deepcopy(indices)
other_indices.remove(index)
for other_index in other_indices:
(can't manage to finish it)
Your suffix tree approach is the right way.
Basically what you need to do is traverse the tree in a BFS fashion. Starting from the root's children you would recursively count the number of leaves reachable by each node. This would lead to a method for Node
which you would call on the root. Here is a possible implementation:
def count_leaves(self, stree):
leaves_count = 0
for child in [stree.nodes[x.dest_node_index] for x in self.edges.values()]:
child_leaves_count = child.count_leaves(stree)
if 0 == child_leaves_count:
# The child node is a leaf...
leaves_count = leaves_count + 1
else:
# The child node is an internal node, we add up the number of leaves it can reach
leaves_count = leaves_count + child_leaves_count
self.leaves_count = leaves_count
return leaves_count
Now, each node is labelled with the number of leaves it can reach.
Then, interesting properties of suffix tree will help you filter out automatically substrings that don't match some of your requirements:
$
). This means it isn't an explicit state and thus we don't even consider it.Now iterating over the internal nodes will get you both the list of substrings and their number of occurences in the input string (you need to filter out the nodes representing a 1 character substring).
You will find below a string representation of the suffix tree for your input string. This will help you visualize which substrings are matches.
- O - N E M $ - ##
- L E L I O N E M $ - ##
- I O - N E M $ - ##
- L E L I O N E M $ - ##
- $ - ##
- E - M $ - ##
L I O - N E M $ - ##
L E L I O N E M $ - ##
N E - L I O L E L I O N E M $ - ##
N E L I O L E L I O N E M $ - ##
- K A - M N E N E N E L I O L E L I O N E M $ - ##
K A M N E N E N E L I O L E L I O N E M $ - ##
- L - E L I O N E M $ - ##
I O - N E M $ - ##
L E L I O N E M $ - ##
- A - M N E N E N E L I O L E L I O N E M $ - ##
K A M N E N E N E L I O L E L I O N E M $ - ##
- M - $ - ##
N E N E N E L I O L E L I O N E M $ - ##
- N E - M $ - ##
L I O L E L I O N E M $ - ##
N E - L I O L E L I O N E M $ - ##
N E L I O L E L I O N E M $ - ##
This leads to the following output:
(IO, 2)
(ELIO, 2)
(ENE, 2)
(KA, 2)
(LIO, 2)
(NE, 4)
(NENE, 2)
We will now assume for instance that LIO
, and IO
should be filtered out because, just like ELIO
, they have two matches. Such substrings of a larger match will be called "redundant matches". The following puzzle remains unsolved: given the set of all matches that occur exactly N times a.k.a N-matches (where N is a fixed integer), how can we filter out the "redundant" ones?
We start by making a priority queue from the set of N-matches ordered by decreasing length. We will then build iteratively a Generalized Suffix Tree (GST) of these matches to identify the redundant matches. To do so the algorithm is the following:
This leads to the following pseudo Python code:
match_heap = heapify(set_of_matches)
good_matches = []
match_gst = generalized_suffix_tree()
while (not match_heap.empty()):
top_match = match_heap.top()
if (not match_gst.is_substring(top_match.string)):
gst_match.insert(top_match.string)
good_matches.append(top_match)
else:
# The given match is a substring of an already registered, bigger match
# We skip it
return good_matches
Now that we can filter the redundant matches for N-matches, it is easy to filter all of them out of our global set of matches. We gather matches in buckets using their number of occurrences, then we apply the algorithm of the previous section on each bucket.
To implement the algorithm above, you need to have a Generalized Suffix Tree implementation, this is a bit different from a regular suffix tree. If you cannot find a Python implementation, you could always adapt the one you got. See this question to get hints on how to do so.