how to render 32bit unicode characters in google v8 (and nodejs)

flow picture flow · Aug 8, 2011 · Viewed 7.7k times · Source

does anyone have an idea how to render unicode 'astral plane' characters (whose CIDs are beyond 0xffff) in google v8, the javascript vm that drives both google chrome and nodejs?

funnily enough, when i give google chrome (it identifies as 11.0.696.71, running on ubuntu 10.4) an html page like this:

<script>document.write( "helo" )
document.write( "𡥂 ⿸𠂇子" );
</script>

it will correctly render the 'wide' character 𡥂 alongside with the 'narrow' ones, but when i try the equivalent in nodejs (using console.log()) i get a single � (0xfffd, REPLACEMENT CHARACTER) for the 'wide' character instead.

i have also been told that for whatever non-understandable reason google have decided to implement characters using a 16bit-wide datatype. while i find that stupid, the surrogate codepoints have been designed precisely to enable the 'channeling' of 'astral codepoints' through 16bit-challenged pathways. and somehow the v8 running inside of chrome 11.0.696.71 seems to use this bit of unicode-foo or other magic to do its work (i seem to remember years ago i always got boxes instead even on static pages).

ah yes, node --version reports v0.4.10, gotta figure out how to obtain a v8 version number from that.

update i did the following in coffee-script:

a = String.fromCharCode( 0xd801 )
b = String.fromCharCode( 0xdc00 )
c = a + b
console.log a
console.log b
console.log c
console.log String.fromCharCode( 0xd835, 0xdc9c )

but that only gives me

���
���
������
������

the thinking behind this is that since that braindead part of the javascript specification that deals with unicode appears to mandate? / not downright forbid? / allows? the use of surrogate pairs, then maybe my source file encoding (utf-8) might be part of the problem. after all, there are two ways to encode 32bit codepoints in utf-8: one is two write out the utf-8 octets needed for the first surrogate, then those for the second; the other way (which is the preferred way, as per utf-8 spec) is to calculate the resulting codepoint and write out the octets needed for that codepoint. so here i completely exclude the question of source file encoding by dealing only with numbers. the above code does work with document.write() in chrome, giving 𐐀𝒜, so i know i got the numbers right.

sigh.

EDIT i did some experiments and found out that when i do

var f = function( text ) {
  document.write( '<h1>',  text,                                '</h1>'  );
  document.write( '<div>', text.length,                         '</div>' );
  document.write( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
  document.write( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' );
  console.log( '<h1>',  text,                                 '</h1>'  );
  console.log( '<div>', text.length,                          '</div>' );
  console.log( '<div>0x', text.charCodeAt(0).toString( 16 ),  '</div>' );
  console.log( '<div>0x', text.charCodeAt(1).toString( 16 ),  '</div>' ); };

f( '𩄎' );
f( String.fromCharCode( 0xd864, 0xdd0e ) );

i do get correct results in google chrome---both inside the browser window and on the console:

𩄎
2
0xd864
0xdd0e
𩄎
2
0xd864
0xdd0e

however, this is what i get when using nodejs' console.log:

<h1> � </h1>
<div> 1 </div>
<div>0x fffd </div>
<div>0x NaN </div>
<h1> �����</h1>
<div> 2 </div>
<div>0x d864 </div>
<div>0x dd0e </div>

this seems to indicate that both parsing utf-8 with CIDs beyond 0xffff and outputting those characters to the console is broken. python 3.1, by the way, does treat the character as a surrogate pair and can print the charactr to the console.

NOTE i've cross-posted this question to the v8-users mailing list.

Answer

Ned Batchelder picture Ned Batchelder · Aug 8, 2011

This recent presentation covers all sorts of issues with Unicode in popular languages, and isn't kind to Javascript: The Good, the Bad, & the (mostly) Ugly

He covers the issue with two-byte representation of Unicode in Javascript:

The UTF‐16 née UCS‐2 Curse

Like several other languages, Javascript suffers from The UTF‐16 Curse. Except that Javascript has an even worse form of it, The UCS‐2 Curse. Things like charCodeAt and fromCharCode only ever deal with 16‐bit quantities, not with real, 21‐bit Unicode code points. Therefore, if you want to print out something like 𝒜, U+1D49C, MATHEMATICAL SCRIPT CAPITAL A, you have to specify not one character but two “char units”: "\uD835\uDC9C". 😱

// ERROR!! 
document.write(String.fromCharCode(0x1D49C));
// needed bogosity
document.write(String.fromCharCode(0xD835,0xDC9C));