How to get rid of non-ascii characters in ruby

Question 1

How to get rid of non-ascii characters in ruby

ruby unicode cgi ascii

Andre Garzia · Aug 12, 2009 · Viewed 52.2k times · Source

Answer

Answer

Use String#encode

The official way to convert between string encodings as of Ruby 1.9 is to use String#encode.

To simply remove non-ASCII characters, you could do this:

some_ascii   = "abc"
some_unicode = "áëëçüñżλφθΩ𠜎😸"
more_ascii   = "123ABC"
invalid_byte = "\255"

non_ascii_string = [some_ascii, some_unicode, more_ascii, invalid_byte].join

# See String#encode documentation
encoding_options = {
  :invalid           => :replace,  # Replace invalid byte sequences
  :undef             => :replace,  # Replace anything not defined in ASCII
  :replace           => '',        # Use a blank for those replacements
  :universal_newline => true       # Always break lines with \n
}

ascii = non_ascii_string.encode(Encoding.find('ASCII'), encoding_options)
puts ascii.inspect
  # => "abce123ABC"

Notice that the first 5 characters in the result are "abce1" - the "á" was discarded, one "ë" was discarded, but another "ë" appears to have been converted to "e".

The reason for this is that there are sometimes multiple ways to express the same written character in Unicode. The "á" is a single Unicode codepoint. The first "ë" is, too. When Ruby sees these during this conversion, it discards them.

But the second "ë" is two codepoints: a plain "e", just like you'd find in an ASCII string, followed by a "combining diacritical mark" (this one), which means "put an umlaut on the previous character". In the Unicode string, these are interpreted as a single "grapheme", or visible character. When converting this, Ruby keeps the plain ASCII "e" and discards the combining mark.

If you decide you'd like to provide some specific replacement values, you could do this:

REPLACEMENTS = { 
  'á' => "a",
  'ë' => 'e',
}

encoding_options = {
  :invalid   => :replace,     # Replace invalid byte sequences
  :replace => "",             # Use a blank for those replacements
  :universal_newline => true, # Always break lines with \n
  # For any character that isn't defined in ASCII, run this
  # code to find out how to replace it
  :fallback => lambda { |char|
    # If no replacement is specified, use an empty string
    REPLACEMENTS.fetch(char, "")
  },
}

ascii = non_ascii_string.encode(Encoding.find('ASCII'), encoding_options)
puts ascii.inspect
  #=> "abcaee123ABC"

Update

Some have reported issues with the :universal_newline option. I have seen this intermittently, but haven't been able to track down the cause.

When it happens, I see Encoding::ConverterNotFoundError: code converter not found (universal_newline). However, after some RVM updates, I've just run the script above under the following Ruby versions without problems:

ruby-1.9.2-p290
ruby-1.9.3-p125
ruby-1.9.3-p194
ruby-1.9.3-p362
ruby-2.0.0-preview2
ruby-head (as of 12-31-2012)

Given this, it doesn't appear to be a deprecated feature or even a bug in Ruby. If anyone knows the cause, please comment.

Question 2

I have a Ruby CGI (not rails) that picks photos and captions from a web form. My users are very keen on using smart quotes and ligatures, they are pasting from other sources. My web app does not deal well with these non-ASCII characters, is there a quick Ruby string manipulation routine that can get rid of non-ASCII chars?

How to get rid of non-ascii characters in ruby

Answer

Use String#encode

Update

Related questions