I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.
I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.
thanks!
One problem you would face if using scipy.cluster.vq.kmeans
is that that function uses Euclidean distance to measure closeness. To shoe-horn your problem into one solveable by k-means
clustering, you'd have to find a way to convert your strings into numerical vectors and be able to justify using Euclidean distance as a reasonable measure of closeness.
That seems... difficult. Perhaps you are looking for Levenshtein distance instead?
Note there are variants of the K-means algorithm that can work with non-Euclideance distance metrics (such as Levenshtein distance). K-medoids
(aka PAM), for instance, can be applied to data with an arbitrary distance metric.
For example, using Pycluster
's implementation of k-medoids
, and nltk
's implementation of Levenshtein distance,
import nltk.metrics.distance as distance
import Pycluster as PC
words = ['apple', 'Doppler', 'applaud', 'append', 'barker',
'baker', 'bismark', 'park', 'stake', 'steak', 'teak', 'sleek']
dist = [distance.edit_distance(words[i], words[j])
for i in range(1, len(words))
for j in range(0, i)]
labels, error, nfound = PC.kmedoids(dist, nclusters=3)
cluster = dict()
for word, label in zip(words, labels):
cluster.setdefault(label, []).append(word)
for label, grp in cluster.items():
print(grp)
yields a result like
['apple', 'Doppler', 'applaud', 'append']
['stake', 'steak', 'teak', 'sleek']
['barker', 'baker', 'bismark', 'park']