Powerbuilder: ImportFile of UTF-8 (Converting UTF-8 to ANSI)

Sid picture Sid · Mar 11, 2014 · Viewed 7.7k times · Source

My Powerbuilder version is 6.5, cannot use a higher version as this is what I am supporting.

My problem is, when I am doing dw_1.ImportFile(file) the first row and first column has a funny string like this:



Which I dont understand until I tried opening the file and saving it to a new text file and trying to import that new file.which worked flawlessly without the funny string.

My conclusion is that this is happening because the file is UTF-8 (as shown in NOTEPAD++) and the new file is Ansi. The file I am trying to import is automatically given by a 3rd party and my users dont want the extra job of doing this.

How do I force convert this files to ANSI in powerbuilder. If there is none, I might have to do a command prompt conversion, any ideas?

Answer

Seki picture Seki · Mar 12, 2014

The weird  characters are the (optional) utf-8 BOM that tells editors that the file is utf-8 encoded (as it can be difficult to know it unless we encounter an escaped character above code 127). You cannot just rid it off because if your file contains any character above 127 (accents or any special char), you will still have garbage in your displayed data (for example: é -> é, -> €, ...) where special characters will become from 2 to 4 garbage chars.

I recently needed to convert some utf-8 encoded string to "ansi" windows 1252 encoding. With version of PB10+, a reencoding between utf-8 and ansi is as simple as

b = blob(s, encodingutf8!)
s2 = string(b, encodingansi!)

But string() and blob() do not support encoding specification before the release 10 of PB.

What you can do is to read the file yourself, skip the BOM, ask Windows to convert the string encoding via MultiByteToWideChar() + WideCharToMultiByte() and load the converted string in the DW with ImportString().

Proof of concept to get the file contents (with this reading method, the file cannot be bigger than 2GB):

string ls_path, ls_file, ls_chunk, ls_ansi
ls_path = sle_path.text
int li_file
if not fileexists(ls_path) then return

li_file = FileOpen(ls_path, streammode!)
if li_file > 0 then
    FileSeek(li_file, 3, FromBeginning!) //skip the utf-8 BOM

    //read the file by blocks, FileRead is limited to 32kB
    do while FileRead(li_file, ls_chunk) > 0
        ls_file += ls_chunk //concatenate in loop works but is not so performant
    loop

    FileClose(li_file)

    ls_ansi = utf8_to_ansi(ls_file)
    dw_tab.importstring( text!, ls_ansi)
end if

utf8_to_ansi() is a globlal function, it was written for PB9, but it should work the same with PB6.5:

global type utf8_to_ansi from function_object
end type

type prototypes
function ulong MultiByteToWideChar(ulong CodePage, ulong dwflags, ref string lpmultibytestr, ulong cchmultibyte, ref blob lpwidecharstr, ulong cchwidechar) library "kernel32.dll"
function ulong WideCharToMultiByte(ulong CodePage, ulong dwFlags, ref blob lpWideCharStr, ulong cchWideChar, ref string lpMultiByteStr, ulong cbMultiByte, ref string lpUsedDefaultChar, ref boolean lpUsedDefaultChar) library "kernel32.dll"
end prototypes

forward prototypes
global function string utf8_to_ansi (string as_utf8)
end prototypes

global function string utf8_to_ansi (string as_utf8);

//convert utf-8 -> ansi
//use a wide-char native string as pivot

constant ulong CP_ACP = 0
constant ulong CP_UTF8 = 65001

string ls_wide, ls_ansi, ls_null
blob lbl_wide
ulong ul_len
boolean lb_flag

setnull(ls_null)
lb_flag = false

//get utf-8 string length converted as wide-char
setnull(lbl_wide)
ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, 0)
//allocate buffer to let windows write into
ls_wide = space(ul_len * 2)
lbl_wide = blob(ls_wide)
//convert utf-8 -> wide char
ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, ul_len)
//get the final ansi string length
setnull(ls_ansi)
ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, 0, ls_null, lb_flag)
//allocate buffer to let windows write into
ls_ansi = space(ul_len)
//convert wide-char -> ansi
ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, ul_len, ls_null, lb_flag)

return ls_ansi
end function