SQL Fuzzy Matching

Jeyara picture Jeyara · Nov 19, 2013 · Viewed 45k times · Source

Hope i am not repeating this question. I did some search here and google before posting here.

I am running a eStore with SQL Server 2008R2 with Full Text enabled.

My requirements,

  1. There is a Product Table, which has product name, OEM Codes, Model which this product fits into. All are in text.
  2. I have created a new column called TextSearch. This has concatenated values of Product Name, OEM Code and Model which this product fits in. These values are comma separated.
  3. When a customer enters a keyword, we run search on TextSearch column to match for products. See matching logic below.

I am using a Hybrid Fulltext and normal like to do search. This gives more relevant results. All the queries executed in to a temp table and distincts were returned.

Matching logic,

  1. Run following SQL to get relevant product using full text. But @Keywords will be pre-processed. Say 'CLC 2200' will be changed to 'CLC* AND 2200*'

    SELECT Id FROM dbo.Product WHERE CONTAINS (TextSearch ,@Keywords)

  2. Another query will be running using normal like. So 'CLC 2200' will be pre-processed to 'TextSearch like %clc% AND TextSearch like %2200%'. This is simply because full text search wont search patterns before the keywords. example, it wont return 'pclc 2200'.

    SELECT Id FROM dbo.Product WHERE TextSearch like '%clc%' AND TextSearch like '%2200%'

  3. If step 1 and 2 didn't return any records, following search will be executed. Value 135 was fine tuned by me to return more relevant records.

    SELECT p.id FROM dbo.Product AS p INNER JOIN FREETEXTTABLE(product,TextSearch,@Keywords) AS r ON p.Id = r.[KEY] WHERE r.RANK > 135

All of above combined works fine in a reasonable speed and returns relevant products for keywords.

But i am looking for to further improve when there is no product found at all.

Say if customer looks for 'CLC 2200npk' and this product wasn't there, i needed to show next very close by 'CLC 2200'.

So far i tried using Soundex() function. Buy computing soundex value for each word in TextSearch column and comparing with soudex value of keyword. But this returns way too many records and slow too.

example, 'CLC 2200npk' will return products such as 'CLC 1100' etc. But this wouldn't be a good result. As it is not close to CLC 2200npk

There is another good one here. but this uses CLR Functions. But i can not install CLR functions on the server.

So my logic would need to be,

if 'CLC 2200npk' not found, show close by 'CLC 2200' if 'CLC 2200' not found, show next close by 'CLC 1100'

Questions

  1. Is it possible to match like as suggested?
  2. If i would need to do spelling correction and search, what would be good way? All of our product listing is in English.
  3. Is there any UDF or SP's to match texts like my suggestions?

Thanks.

Answer

David Ewen picture David Ewen · Nov 19, 2013

A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.

Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.

CREATE Function [dbo].[ExtractNumeric](@input VARCHAR(1000))
RETURNS INT
AS
BEGIN
    WHILE PATINDEX('%[^0-9]%', @input) > 0
    BEGIN
        SET @input = STUFF(@input, PATINDEX('%[^0-9]%', @input), 1, '')
    END
    IF @input = '' OR @input IS NULL
        SET @input = '0'
    RETURN CAST(@input AS INT)
END
GO

As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)

  • Double Metaphone - This algo will give you a better match than soundex at the cost of speed it is really good for spelling correction though.
  • Levenshtein Distance - This will calculate how many keypresses it would take to turn one string into another for instance to get from 'CLC 2200npk' to 'CLC 2200' is 3, while from 'CLC 2200npk' to 'CLC 1100' is 5.

Here is an interesting article which applies both algos together which may give you a few ideas.

Well hopefully some of that helps a little.

EDIT: Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.