Convert non-ASCII chars from ASCII-8BIT to UTF-8

craic.com picture craic.com · Feb 1, 2011 · Viewed 70.6k times · Source

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

Here is an example of some offending text:

Cancer Res; 71(3); 1-11. ©2011 AACR.\n

That Copyright code expanded looks like this:

Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n

Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:

incompatible character encodings: ASCII-8BIT and UTF-8

I can strip the copyright code out using this regex

str.gsub(/[\x00-\x7F]/n,'?')

to produce this

Cancer Res; 71(3); 1-11. ??2011 AACR.\n

But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...

I see references to using force_encoding but this does not work:

str.force_encoding('utf-8').encode

I realize there are many other people with similar issues but I've yet to see a solution that works.

Answer

Phrogz picture Phrogz · Feb 2, 2011

This works for me:

#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>

str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>