setting a UTF-8 in java and csv file

mehdi picture mehdi · Nov 16, 2010 · Viewed 92.5k times · Source

I am using this code for add Persian words to a csv file via OpenCSV:

String[] entries="\u0645 \u062E\u062F\u0627".split("#");
try{
    CSVWriter writer=new CSVWriter(new OutputStreamWriter(new FileOutputStream("C:\\test.csv"), "UTF-8"));

    writer.writeNext(entries);
    writer.close();
}
catch(IOException ioe){
    ioe.printStackTrace();
}

When I open the resulting csv file, in Excel, it contains "ứỶờịỆ". Other programs such as notepad.exe don't have this problem, but all of my users are using MS Excel.

Replacing OpenCSV with SuperCSV does not solve this problem.

When I typed Persian characters into csv file manually, I don't have any problems.

Answer

AlexR picture AlexR · Nov 16, 2010

I spent some time but found solution for your problem.

First I opened notepad and wrote the following line: שלום, hello, привет Then I saved it as file he-en-ru.csv using UTF-8. Then I opened it with MS excel and everything worked well.

Now, I wrote a simple java program that prints this line to file as following:

    PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
    w.print(line);
    w.flush();
    w.close();

When I opened this file using excel I saw "gibrish."

Then I tried to read content of 2 files and (as expected) saw that file generated by notepad contains 3 bytes prefix:

    239 EF
    187 BB
    191 BF

So, I modified my code to print this prefix first and the text after that:

    String line = "שלום, hello, привет";
    OutputStream os = new FileOutputStream("c:/temp/j.csv");
    os.write(239);
    os.write(187);
    os.write(191);

    PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));

    w.print(line);
    w.flush();
    w.close();

And it worked! I opened the file using excel and saw text as I expected.

Bottom line: write these 3 bytes before writing the content. This prefix indicates that the content is in 'UTF-8 with BOM' (otherwise it is just 'UTF-8 without BOM').