I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is:
"草鷗外.gif"
which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif
If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail:
File[] files = folder.listFiles();
for (File file : files) {
if (!file.exists()) {
System.out.println("Failed to find File"); //Fails on the surrogate filename
}
}
Most of the code I am actually dealing with are of the form:
FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow
Is there some way I can address this problem, either escaping the filenames or opening files differently?
I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.
“Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:
\xF0\xA6\xBF\xB6
it outputs a UTF-8-encoded sequence for each of the surrogates:
\xED\xA1\x9B\xED\xBF\xB6
This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.
So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:
$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')
On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:
['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']
which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:
['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']
it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:
os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')