Equivalent of Iconv.conv("UTF-8//IGNORE",...) in Ruby 1.9.X?

Jordan Feldstein picture Jordan Feldstein · Oct 24, 2011 · Viewed 9.8k times · Source

I'm reading data from a remote source, and occassionally get some characters in another encoding. They're not important.

I'd like to get get a "best guess" utf-8 string, and ignore the invalid data.

Main goal is to get a string I can use, and not run into errors such as:

  • Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
  • invalid byte sequence in utf-8

Answer

Jordan Feldstein picture Jordan Feldstein · Oct 24, 2011

I thought this was it:

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

will replace all knowns with '?'.

To ignore all unknowns, :replace => '':

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Edit:

I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:

string.encode("UTF-8", ...).force_encoding('UTF-8')

Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.

Edit 2:

Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.