How to get the length of Japanese characters in Javascript?

mark uy picture mark uy · Jul 12, 2012 · Viewed 7.1k times · Source

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:

<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">

My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.

if(document.frmPage.txtName.value.length > 200) {
  alert("You have exceeded the maximum length of 200.");
  return false;
}

The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.

If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.

The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?

Thanks for the help!

Answer

bobince picture bobince · Jul 13, 2012

For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding

Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.

In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:

function getShiftJISByteLength(s) {
    return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}

However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)

Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case &#27979;. This is a lossy mangling: you can't tell whether the user typed literally or &#27979;. And if you are displaying the submitted content &#27979; as then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.

The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:

function getUTF8ByteLength(s) {
    return unescape(encodeURIComponent(s)).length;
}

although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.