We are running many products search on a huge catalog with partially matched barcodes.
We started with a simple like query
select * from products where barcode like '%2345%'
But that takes way too long since it requires a full table scan. We thought a fulltext search will be able to help us here using contains.
select * from products where contains(barcode, '2345')
But, it seems like contains doesn't support finding words that partially contains a text but, only full a word match or a prefix. (But in this example we're looking for '123456').
My answer is: @DenisReznik was right :)
ok, let's take a look.
I have worked with barcodes and big catalogs for many years and I was curious about this question.
So I have made some tests on my own.
I have created a table to store test data:
CREATE TABLE [like_test](
[N] [int] NOT NULL PRIMARY KEY,
[barcode] [varchar](40) NULL
)
I know that there are many types of barcodes, some contains only numbers, other contains also letters, and other can be even much complex.
Let's assume our barcode is a random string.
I have filled it with 10 millions records of random alfanumeric data:
insert into like_test
select (select count(*) from like_test)+n, REPLACE(convert(varchar(40), NEWID()), '-', '') barcode
from FN_NUMBERS(10000000)
FN_NUMBERS() is just a function I use in my DBs (a sort of tally_table) to get records quick.
I got 10 million records like that:
N barcode
1 1C333262C2D74E11B688281636FAF0FB
2 3680E11436FC4CBA826E684C0E96E365
3 7763D29BD09F48C58232C7D33551E6C9
Let's declare a var to search for:
declare @s varchar(20) = 'D34F15' -- random alfanumeric string
Let's take a base try with LIKE to compare results to:
select * from like_test where barcode like '%'+@s+'%'
On my workstation it takes 24.4 secs for a full clustered index scan. Very slow.
SSMS suggests to add an index on barcode column:
CREATE NONCLUSTERED INDEX [ix_barcode] ON [like_test] ([barcode]) INCLUDE ([N])
500Mb of index, I retry the select, this time 24.0 secs for the non clustered index seek.. less than 2% better, almost the same result. Very far from the 75% supposed by SSMS. It seems to me this index really doesn't worth it. Maybe my SSD Samsung 840 is making the difference..
For the moment I let the index active.
Let's try the CHARINDEX solution:
select * from like_test where charindex(@s, barcode) > 0
This time it took 23.5 second to complete, not really so much better than LIKE.
Now let's check the @DenisReznik 's suggestion that using the Binary Collation should speed up things.
select * from like_test
where barcode collate Latin1_General_BIN like '%'+@s+'%' collate Latin1_General_BIN
WOW, it seems to work! Only 4.5 secs this is impressive! 5 times better..
So, what about CHARINDEX and Collation toghether? Let's try it:
select * from like_test
where charindex(@s collate Latin1_General_BIN, barcode collate Latin1_General_BIN)>0
Unbelivable! 2.4 secs, 10 times better..
Ok, so far I have realized that CHARINDEX is better than LIKE, and that Binary Collation is better than normal string collation, so from now on I will go on only with CHARINDEX and Collation.
Now, can we do anything else to get even better results? Maybe we can try reduce our very long strings.. a scan is always a scan..
First try, a logical string cut using SUBSTRING to virtually works on barcodes of 8 chars:
select * from like_test
where charindex(
@s collate Latin1_General_BIN,
SUBSTRING(barcode, 12, 8) collate Latin1_General_BIN
)>0
Fantastic! 1.8 seconds.. I have tried both SUBSTRING(barcode, 1, 8)
(head of the string) and SUBSTRING(barcode, 12, 8)
(middle of the string) with same results.
Then I have tried to phisically reduce the size of the barcode column, almost no difference than using SUBSTRING()
Finally I have tried to drop the index on barcode column and repeated ALL above tests...
I was very surprised to get almost same results, with very little differences.
Index performs 3-5% better, but at cost of 500Mb of disk space and and maintenance cost if the catalog is updated.
Naturally, for a direct key lookup like where barcode = @s
with the index it takes 20-50 millisecs, without index we can't get less than 1.1 secs using Collation syntax where barcode collate Latin1_General_BIN = @s collate Latin1_General_BIN
This was interesting.
I hope this helps