I'm building a private spellchecker for the Microsoft Office suite. I'm doing string comparisons of typos and their potential fixes to determine which corrections I want included.
I've looked high and low for a weighted Damerau-Levenshtein formula for string comparison because I want swaps, insertions, deletions and replacements to all have different weights, not simply a weight of "1", so I can give preference to some corrections over others. For example, the typo "agmes" could theoretically correct to "games" or "ages", since both require just one edit to move to either correctly spelled word, but I'd like to give the "swap" edit a lower weight so that "games" would show as the preferred correction.
I'm using Excel for analysis, so any code I use needs to be in Visual Basic for Applications (VBA). The best I could find is this example, which seems great, but it's in Java. I tried my best to convert, but I'm far from an expert and could use a little help!
Can anyone take a look at the attached code and help me figure out what's wrong?
THANK YOU!
EDIT: I got it working on my own. Here's a weighted Damerau-Levenshtein formula in VBA. It uses Excel's built-in math functions for some evaluation. When comparing a typo to two possible corrections, the correction with the highest cost is the preferred word. This is because the cost of two swaps must be greater than the cost of a deletion and an insertion, and that's not possible if you assign swaps with the lowest cost (which I think is ideal). Check out Kevin's blog if you need more info.
Public Function WeightedDL(source As String, target As String) As Double
Dim deleteCost As Double
Dim insertCost As Double
Dim replaceCost As Double
Dim swapCost As Double
deleteCost = 1
insertCost = 1.1
replaceCost = 1.1
swapCost = 1.2
Dim i As Integer
Dim j As Integer
Dim k As Integer
If Len(source) = 0 Then
WeightedDL = Len(target) * insertCost
Exit Function
End If
If Len(target) = 0 Then
WeightedDL = Len(source) * deleteCost
Exit Function
End If
Dim table() As Double
ReDim table(Len(source), Len(target))
Dim sourceIndexByCharacter() As Variant
ReDim sourceIndexByCharacter(0 To 1, 0 To Len(source) - 1) As Variant
If Left(source, 1) <> Left(target, 1) Then
table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost))
End If
sourceIndexByCharacter(0, 0) = Left(source, 1)
sourceIndexByCharacter(1, 0) = 0
Dim deleteDistance As Double
Dim insertDistance As Double
Dim matchDistance As Double
For i = 1 To Len(source) - 1
deleteDistance = table(i - 1, 0) + deleteCost
insertDistance = ((i + 1) * deleteCost) + insertCost
If Mid(source, i + 1, 1) = Left(target, 1) Then
matchDistance = (i * deleteCost) + 0
Else
matchDistance = (i * deleteCost) + replaceCost
End If
table(i, 0) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance)
Next
For j = 1 To Len(target) - 1
deleteDistance = table(0, j - 1) + insertCost
insertDistance = ((j + 1) * insertCost) + deleteCost
If Left(source, 1) = Mid(target, j + 1, 1) Then
matchDistance = (j * insertCost) + 0
Else
matchDistance = (j * insertCost) + replaceCost
End If
table(0, j) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance)
Next
For i = 1 To Len(source) - 1
Dim maxSourceLetterMatchIndex As Integer
If Mid(source, i + 1, 1) = Left(target, 1) Then
maxSourceLetterMatchIndex = 0
Else
maxSourceLetterMatchIndex = -1
End If
For j = 1 To Len(target) - 1
Dim candidateSwapIndex As Integer
candidateSwapIndex = -1
For k = 0 To UBound(sourceIndexByCharacter, 2)
If sourceIndexByCharacter(0, k) = Mid(target, j + 1, 1) Then candidateSwapIndex = sourceIndexByCharacter(1, k)
Next
Dim jSwap As Integer
jSwap = maxSourceLetterMatchIndex
deleteDistance = table(i - 1, j) + deleteCost
insertDistance = table(i, j - 1) + insertCost
matchDistance = table(i - 1, j - 1)
If Mid(source, i + 1, 1) <> Mid(target, j + 1, 1) Then
matchDistance = matchDistance + replaceCost
Else
maxSourceLetterMatchIndex = j
End If
Dim swapDistance As Double
If candidateSwapIndex <> -1 And jSwap <> -1 Then
Dim iSwap As Integer
iSwap = candidateSwapIndex
Dim preSwapCost
If iSwap = 0 And jSwap = 0 Then
preSwapCost = 0
Else
preSwapCost = table(Application.Max(0, iSwap - 1), Application.Max(0, jSwap - 1))
End If
swapDistance = preSwapCost + ((i - iSwap - 1) * deleteCost) + ((j - jSwap - 1) * insertCost) + swapCost
Else
swapDistance = 500
End If
table(i, j) = Application.Min(Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance), swapDistance)
Next
sourceIndexByCharacter(0, i) = Mid(source, i + 1, 1)
sourceIndexByCharacter(1, i) = i
Next
WeightedDL = table(Len(source) - 1, Len(target) - 1)
End Function
I can see you've answered this yourself: I wrote a modified Levenshtein edit distance algorithm for address matching a couple of years ago (the site's now hosted in Russia and it's a bad idea to go there) but that didn't perform at all well, and a 'sum of common strings' approach was adequate for the task in hand:
Fuzzy-Matching strings in Excel using a simplified 'Edit Distance' proxy in VBA
That code probably needs re-testing and re-work.
Looking at your code, if you ever want to revisit it, here's a speed tip:
Dim arrByte() As Byte Dim byteChar As BytearrByte = strSource
for i = LBound(arrByte) To UBound(arrByte) Step 2 byteChar = arrByte(i) ' I'll do some comparison operations using integer arithmetic on the char Next i
String-handling in VBA is horribly slow, even if you use Mid$() instead of Mid(), but numeric operations are pretty good: and strings are actually arrays of bytes, which the compiler will accept at face value.
The 'step' of 2 in the loop is to skip over the high-order bytes in unicode strings - you're probably running your string comparison on plain-vanilla ASCII text, and you'll see that the byte array for (say) "ABCd" is (00, 65, 00, 66, 00, 67, 00, 100). Most of the Latin alphabet in Western European countries - accents, diacritics, dipthongs and all - will fit in under 255 and won't venture into the higer-order bytes that show as zeroes in that wxample.
You'll get away with it in strictly monolingual string comparisons in Hebrew, Greek, Russian and Arabic because the upper byte is constant within each alphabet: Greek "αβγδ" is the byte array (177,3,178,3,179,3,180,3). However, that's sloppy coding and it'll bite (or byte) you the moment you try string comparisons across languages. And it's never going to fly in Eastern alphabets.