I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.
I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.
If the string was identical because no random error was introduced, I would use document.index(substring)
, however this fails if there is even one character difference.
I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.
The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.
You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: http://flori.github.com/amatch/.
Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:
include 'amatch'
module FuzzyFinder
def scanner( input )
out = [] unless block_given?
pos = 0
input.scan(/(\w+)(\W*)/) do |word, white|
startpos = pos
pos = word.length + white.length
if block_given?
yield startpos, word
else
out << [startpos, word]
end
end
end
def find( text, doc )
index = scanner(doc)
sstr = text.gsub(/\W/,'')
levenshtein = Amatch::Levensthtein.new(sstr)
minlen = sstr.length
maxndx = index.length
possibles = []
minscore = minlen*2
index.each_with_index do |x, i|
spos = x[0]
str = x[1]
si = i
while (str.length < minlen)
i += 1
break unless i < maxndx
str += index[i][1]
end
str = str.slice(0,minlen) if (str.length > minlen)
score = levenshtein.search(str)
if score < minscore
possibles = [spos]
minscore = score
elsif score == minscore
possibles << spos
end
end
[minscore, possibles]
end
end
Obviously there are numerous improvements possible and probably necessary! A few off the top: