I have a web site that receives a CSV file by FTP once a month. For years it was an ASCII file. Now I'm receiving UTF-8 one month then UTF-16BE the next and UTF-16LE the month after that. Maybe I'll get UTF-32 next month. Fgets returns the byte order mark at the beginning of the UTF files. How can I get PHP to automatically recognize the character encoding? I had tried mb_detect_encoding and it returned ASCII regardless of the file type. I changed my code to read the BOM and explicitly put the character encoding into mb_convert_encoding. This worked until the latest file, which is UTF-16LE. In this file it reads the first line correctly and all subsequent lines show as question marks ("?"). What am I doing wrong?
$fhandle = fopen( $file_in, "r" );
if ( fhandle === false )
{
echo "<p class=redbold>Error opening file $file_in.</p>";
die();
}
$i = 0;
while( ( $line = fgets( $fhandle ) ) !== false )
{
$i++;
// Detect encoding on first line. Actual text always begins with string "Document"
if ( $i == 1 )
{
$line_start = substr( $line, 0, 4 );
$line_start_hex = bin2hex( $line_start );
$utf16_start = 'fffe4400';
$utf8_start = 'efbbbf44';
if ( strcmp( $line_start, 'Docu' ) == 0 )
{ $char_encoding = 'ASCII'; }
elseif ( strcmp( $line_start_hex, 'efbbbf44' ) == 0 )
{
$char_encoding = 'UTF-8';
$line = substr( $line, 3 );
}
elseif ( strcmp( $line_start_hex, 'fffe4400' ) == 0 )
{
$char_encoding = 'UTF-16LE';
$line = substr( $line, 2 );
}
elseif ( strcmp( $line_start_hex, 'feff4400' ) == 0 )
{
$char_encoding = 'UTF-16BE';
$line = substr( $line, 2 );
}
else
{
echo "<p class=redbold>Error, unknown character encoding. Line =<br>", $line_start_hex, '</p>';
require( '../footer.php' );
die();
}
echo "<p>char_encoding = $char_encoding</p>";
}
// Convert UTF
if ( $char_encoding != 'ASCII' )
{
$line = mb_convert_encoding( $line, 'ASCII', $char_encoding);
}
echo '<p>'; var_dump( $line ); echo '</p>';
}
Output:
char_encoding = UTF-16LE
string(101) "DocumentNumber,RecordedTS,Title,PageCount,City,TransTaxAccountCode,TotalTransferTax,Description,Name
"
string(83) "???????????????????????????????????????????????????????????????????????????????????"
string(88) "????????????????????????????????????????????????????????????????????????????????????????"
string(84) "????????????????????????????????????????????????????????????????????????????????????"
string(80) "????????????????????????????????????????????????????????????????????????????????"
Explicitly pass the order and possible encodings to detect, and use strict parameter. Also
please use file_get_contents
, if the file is in UTF-16LE, fgets
will screw it up for you.
<?php
header( "Content-Type: text/html; charset=utf-8");
$input = file_get_contents( $file_in );
$encoding = mb_detect_encoding( $input, array(
"UTF-8",
"UTF-32",
"UTF-32BE",
"UTF-32LE",
"UTF-16",
"UTF-16BE",
"UTF-16LE"
), TRUE );
if( $encoding !== "UTF-8" ) {
$input = mb_convert_encoding( $input, "UTF-8", $encoding );
}
echo "<p>$encoding</p>";
foreach( explode( PHP_EOL, $input ) as $line ) {
var_dump( $line );
}
The order is important because UTF-8 and UTF-32 are more restrictive and UTF-16 is extremely permissive; pretty much any random even length of bytes are valid UTF-16.
The only way you will retain all information, is to convert it to an unicode encoding, not ASCII.