I am doing a BULK INSERT into sqlserver and it is not inserting UTF-8 characters into database properly. The data file contains these characters, but the database rows contain garbage characters after bulk insert execution.
My first suspect was the last line of the format file:
10.0
3
1 SQLCHAR 0 0 "{|}" 1 INSTANCEID ""
2 SQLCHAR 0 0 "{|}" 2 PROPERTYID ""
3 SQLCHAR 0 0 "[|]" 3 CONTENTTEXT "SQL_Latin1_General_CP1_CI_AS"
But, after reading this official page it seems to me that this is actually a bug in reading the data file by the insert operation in SQL Server version 2008. We are using version 2008 R2.
What is the solution to this problem or at least a workaround?
I came here before looking for a solution for bulk inserting special characters. Didn't like the workaround with UTF-16 (that would double the size of csv file). I found out that you definitely CAN and it's very easy, you don't need a format file. This answer is for other people who are looking for the same, since it doesn't seem to be documented well anywhere, and I believe this is a very common issue for non-english speaking people. The solution is: just add CODEPAGE='65001' inside the with statement of the bulk insert. (65001=codepage number for UTF-8). Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.
Note: MSDN documentation says utf-8 is not supported, don't believe it, for me this works perfect in SQL server 2008, didn't try other versions however.
e.g.:
BULK INSERT #myTempTable
FROM 'D:\somefolder\myCSV.txt'+
WITH
(
CODEPAGE = '65001',
FIELDTERMINATOR = '|',
ROWTERMINATOR ='\n'
);
If all your special characters are in 160-255 (iso-8859-1 or windows-1252), you could also use:
BULK INSERT #myTempTable
FROM 'D:\somefolder\myCSV.txt'+
WITH
(
CODEPAGE = 'ACP',
FIELDTERMINATOR = '|',
ROWTERMINATOR ='\n'
);