😃 (and other unicode characters) in identifiers not allowed by g++

Joseph Mansfield picture Joseph Mansfield ยท Oct 2, 2012 ยท Viewed 9.7k times ยท Source

I am ๐Ÿ˜ž to find that I cannot use ๐Ÿ˜ƒ as a valid identifier with g++ 4.7, even with the -fextended-identifiers option enabled:

int main(int argc, const char* argv[])
{
  const char* ๐Ÿ˜ƒ = "I'm very happy";
  return 0;
}

main.cpp:3:3: error: stray โ€˜\360โ€™ in program
main.cpp:3:3: error: stray โ€˜\237โ€™ in program
main.cpp:3:3: error: stray โ€˜\230โ€™ in program
main.cpp:3:3: error: stray โ€˜\203โ€™ in program

After some googling, I discovered that UTF-8 characters are not yet supported in identifiers but a universal-character-name should work. So I convert my source to:

int main(int argc, const char* argv[])
{
  const char* \U0001F603 = "I'm very happy";
  return 0;
}

main.cpp:3:15: error: universal character \U0001F603 is not valid in an identifier

So apparently ๐Ÿ˜ƒ isn't a valid identifier character. However, the standard specifically allows characters from the range 10000-1FFFD in Annex E.1 and doesn't disallow it as an initial character in E.2. My next effort was to see if any other allowed unicode characters worked - but none that I tried did. Not even the ever important PILE OF POO (๐Ÿ’ฉ) character.

So, for the sake of meaningful and descriptive variable names, what gives? Does -fextended-identifiers do as it advertises or not? Is it only supported in the very latest build? And what kind of support do other compilers have?

Answer

kennytm picture kennytm ยท Oct 2, 2012

As of 4.8, gcc does not support characters outside of the BMP used as identifiers. It seems to be an unnecessary restriction. Also, gcc only supports a very restricted set of character described in ucnid.tab, based on C99 and C++98 (it is not updated to C11 and C++11 yet, it seems).

As described in the manual, -fextended-identifiers is experimental, so it has a higher chance won't work as expected.


Edit:

GCC supported the C11 character set starting from 4.9.0 (svn r204886 to be precise). So OP's second piece of code using \U0001F603 does work. I still can't get the actual code using ๐Ÿ˜ƒ to work even with -finput-charset=UTF-8 with GCC 8.2 on https://gcc.godbolt.org though (You may want to follow this bug report, provided by @DanielWolf).

Meanwhile both pieces of code work on clang 3.3 without any options other than -std=c++11.