How does clustering (especially String clustering) work?

string cluster-analysis data-mining

Renato Dinhani · Nov 19, 2011 · Viewed 25.6k times · Source

I heard about clustering to group similar data. I want to know how it works in the specific case for String.

I have a table with more than different 100,000 words.

I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, @house, "house", etc...).

What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this?

Answer

To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.

All clustering algorithms are based on the distance (or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hamming and Levenshtein distance. In your particular case Levenshtein distance if more preferable (Hamming distance works only with the strings of same size).

Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.

One of them is expectation maximization algorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.

Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.

How does clustering (especially String clustering) work?

Answer

Related questions