SELECT FROM OPENROWSET( BULK...) changing special characters

Jonathan P. picture Jonathan P. · Jul 22, 2013 · Viewed 14k times · Source

I've been facing an issue for a few hours, and I can't seem to get my head around this one.

So I have a SQL Server database 2008R2, Collation SQL_Latin1_General_CP1_CI_AS. Inside there is a table, with a field named incoming_name. The collation of this field is also SQL_Latin1_General_CP1_CI_AS, and it is a NVARCHAR(255).

I have a .csv file with around 123000 rows. It's a basic csv, no double quotes around text, but no comma inside the fields, so when I run a manual import into my database it works fine. The incoming_name field contains all kind of text, but never longer than 255 characters. And in a few lines there are french accents (like 'Ch*â*teau d'Agassac').

Now I try to use the code

select 
    test_file.[INCOMING_NAME] COLLATE SQL_Latin1_General_CP1_CI_AS
    as [INCOMING_NAME]
    , test_file.[PRODUCT_CODE] AS [PRODUCT_CODE]
FROM
                OPENROWSET(
                BULK 'INSERT PATH OF THE .CSV HERE',
                FORMATFILE = 'INSERT PATH OF THE FORMAT FILE HERE',
                FIRSTROW = 2
                ) AS test_file

With the format file

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <RECORD>
        <FIELD ID="4" xsi:type="CharTerm" TERMINATOR=',' MAX_LENGTH="255" COLLATION="SQL_LATIN1_GENERAL_CP1_CI_AS" />
        <FIELD ID="29" xsi:type="CharTerm" TERMINATOR='\r\n' />
    </RECORD>
    <ROW>
        <COLUMN SOURCE="4" NAME="INCOMING_NAME" xsi:type="SQLNVARCHAR"/>
        <COLUMN SOURCE="29" NAME="PRODUCT_CODE" xsi:type="SQLNVARCHAR"/>    
    </ROW>

The import works fine, and I get all my data, with the right values in the right fields, except for the accents...

For example when I add where test_file.incoming_name like '%agassac%' at the end of my query, I get a result like 'Château d'Agassac' instead of the original data 'Château d'Agassac' in my database.

What I don't understand is that I feel like at every step of the process, I did pick an accent sensitive collation, with a unicode datatype (NVARCHAR), so I really don't understand why the import doesn't pick the accents.

Thanks for reading this long question,

John.

EDIT: Ok, it looks like the .csv file I want to import is encoded with utf-8, and SQL Server 2008 doesn't want to support utf-8 import. Now I have no idea what to do. Any idea welcome...

Answer

Sonam picture Sonam · Jul 22, 2013

I think adding widenative as DATAFILETYPE should resolve the issue. Please refer to this link for further details: http://msdn.microsoft.com/en-us/library/ms189941.aspx