My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.
At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 );
is in effect).
I'm specifically having trouble with the sequence "\xFE\xBF\xBE"
. If I create a file containing only these three bytes (perl -e 'print "\xEF\xBF\xBE"' > bad.txt
), trying to read the file with mode :encoding(UTF-8)
errors out with utf8 "\xFFFE" does not map to Unicode
, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the \xFFFE
(illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.
Unfortunately, decode_utf8("\xEF\xBF\xBE", 1)
causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.
I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?
You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.
To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8
, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.
The other encoding is called utf-8
(a.k.a. utf-8-strict
). This allows only codepoints that are listed as legal for interchange by the Unicode standard.
"\xEF\xBF\xBE"
, when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.
Instead of using decode_utf8
(which uses the lax utf8
encoding), use decode
with the utf-8
encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.
Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict
encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr
).