Using SQLServer contains for partial words

Guy Korland picture Guy Korland · Jun 7, 2016 · Viewed 9.3k times · Source

We are running many products search on a huge catalog with partially matched barcodes.

We started with a simple like query

select * from products where barcode like '%2345%'

But that takes way too long since it requires a full table scan. We thought a fulltext search will be able to help us here using contains.

select * from products where contains(barcode, '2345')

But, it seems like contains doesn't support finding words that partially contains a text but, only full a word match or a prefix. (But in this example we're looking for '123456').

Answer

MtwStark picture MtwStark · Jul 12, 2016

My answer is: @DenisReznik was right :)

ok, let's take a look.
I have worked with barcodes and big catalogs for many years and I was curious about this question.

So I have made some tests on my own.

I have created a table to store test data:

CREATE TABLE [like_test](
    [N] [int] NOT NULL PRIMARY KEY,
    [barcode] [varchar](40) NULL
) 

I know that there are many types of barcodes, some contains only numbers, other contains also letters, and other can be even much complex.

Let's assume our barcode is a random string.
I have filled it with 10 millions records of random alfanumeric data:

insert into like_test
select (select count(*) from like_test)+n, REPLACE(convert(varchar(40), NEWID()), '-', '') barcode 
from FN_NUMBERS(10000000)

FN_NUMBERS() is just a function I use in my DBs (a sort of tally_table) to get records quick.

I got 10 million records like that:

N   barcode
1   1C333262C2D74E11B688281636FAF0FB
2   3680E11436FC4CBA826E684C0E96E365
3   7763D29BD09F48C58232C7D33551E6C9

Let's declare a var to search for:

declare @s varchar(20) = 'D34F15' -- random alfanumeric string 

Let's take a base try with LIKE to compare results to:

select * from like_test where barcode like '%'+@s+'%'

On my workstation it takes 24.4 secs for a full clustered index scan. Very slow.

SSMS suggests to add an index on barcode column:

CREATE NONCLUSTERED INDEX [ix_barcode] ON [like_test] ([barcode]) INCLUDE ([N])

500Mb of index, I retry the select, this time 24.0 secs for the non clustered index seek.. less than 2% better, almost the same result. Very far from the 75% supposed by SSMS. It seems to me this index really doesn't worth it. Maybe my SSD Samsung 840 is making the difference..
For the moment I let the index active.

Let's try the CHARINDEX solution:

select * from like_test where charindex(@s, barcode) > 0

This time it took 23.5 second to complete, not really so much better than LIKE.

Now let's check the @DenisReznik 's suggestion that using the Binary Collation should speed up things.

select * from like_test
where barcode collate Latin1_General_BIN like '%'+@s+'%' collate Latin1_General_BIN 

WOW, it seems to work! Only 4.5 secs this is impressive! 5 times better..
So, what about CHARINDEX and Collation toghether? Let's try it:

select * from like_test
where charindex(@s collate Latin1_General_BIN, barcode collate Latin1_General_BIN)>0

Unbelivable! 2.4 secs, 10 times better..

Ok, so far I have realized that CHARINDEX is better than LIKE, and that Binary Collation is better than normal string collation, so from now on I will go on only with CHARINDEX and Collation.

Now, can we do anything else to get even better results? Maybe we can try reduce our very long strings.. a scan is always a scan..

First try, a logical string cut using SUBSTRING to virtually works on barcodes of 8 chars:

select * from like_test
where charindex(
        @s collate Latin1_General_BIN, 
        SUBSTRING(barcode, 12, 8) collate Latin1_General_BIN
      )>0

Fantastic! 1.8 seconds.. I have tried both SUBSTRING(barcode, 1, 8) (head of the string) and SUBSTRING(barcode, 12, 8) (middle of the string) with same results.

Then I have tried to phisically reduce the size of the barcode column, almost no difference than using SUBSTRING()

Finally I have tried to drop the index on barcode column and repeated ALL above tests... I was very surprised to get almost same results, with very little differences.
Index performs 3-5% better, but at cost of 500Mb of disk space and and maintenance cost if the catalog is updated.

Naturally, for a direct key lookup like where barcode = @s with the index it takes 20-50 millisecs, without index we can't get less than 1.1 secs using Collation syntax where barcode collate Latin1_General_BIN = @s collate Latin1_General_BIN

This was interesting.
I hope this helps