How can I detect common substrings in a list of strings

danio picture danio · Sep 11, 2009 · Viewed 24.9k times · Source

Given a set of strings, for example:

EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green

I want to be able to detect that these are three sets of files:

  • EntireS[1,2]
  • J27[Red,Green]P[1,2]
  • JournalP[1,2][Red,Green,Blue]

Are there any known ways of approaching this problem - any published papers I can read on this?

The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.

Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.

Answer

Jeremy Bourque picture Jeremy Bourque · Sep 11, 2009

I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem

There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.

Edited to add:

Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.

I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.

To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.

Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).

In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.