How to export umlaut (or any foreign character) in Matlab eps format?

user1588695 picture user1588695 · Aug 9, 2012 · Viewed 8.2k times · Source

I'm trying to use an umlaut within a legend command in MATLAB. A quick Google tells me the form I want is char(146), and that works fine for displaying the file, or printing it to tif.

But when I print to EPS format (or epsc, eps2, epsc2) then a different character is displayed in the file. I've tried printing the first 300-odd characters, and they certainly change (albeit very slowly, a good half of which are "A" with a symbol immediately afterward), but this seems a pretty slow approach, and I'm not guaranteed to actually find the symbol I want. So, does anyone here have any ideas on what I can try?.

I'm using MATLAB R2011a, my default character-set is UTF-8, my print line looks something like..

legend( plot_id , strcat('lala',char(146)) )

and my print line looks like..

print -depsc2 -tiff -r600 <filename>

(but switching off the tiff thumbnail generation doesn't have any effect)

Answer

guimillet picture guimillet · Feb 8, 2013

The problem appears when MATLAB character encoding is UTF-8, which is usually the case for Linux users (hence no problem for Amro's configuration using CP1252). When MATLAB character set encoding (get it with slCharacterEncoding()) is UTF-8, MATLAB eps export function is bugged (at least until R2011b) as it exports the non-ASCII characters in the octal escaped UTF-8 format (2 bytes) whereas the Postscript interpreter is set to decode 1-byte format.

Let's illustrate the bug with the character ö U+00F6 whose some representations are:

  • UTF-16: 0x00F6
  • UTF-8: 0xC3 0xB6
  • C octal escaped UTF-8: \303\266
  • XML decimal entity: &#246

The eps file created by MATLAB contains:

/Helvetica /ISOLatin1Encoding 120 FMSR
(\303\266) s

MATLAB defines in the eps file a function FMSR that re-encodes Helvetica font into another encoding, here ISOLatin1Encoding which is one of the two built-in encoding vectors and closely matches the ISO-8859-1 (Latin1) standard (see p.329-330 of the Postscript Language Reference Manual for more details). Briefly, encoding vectors are 256-element arrays that associates a character name to a character code. So it only reads 1-byte character codes. In ISO-8859-1, \303=195=à and \266=182=¶. As a result, it prints ö.

Options for exporting non-ASCII ISO-8859-1 characters with a UTF-8 locale environment

  1. Convert the octal UTF-8 codes into octal ISO-8859-1 codes, which is easy because non-ASCII ISO-8859-1 characters follow the same layout in UTF-8. For example, with the program sed, which can be run from the Command window or from your export script:

    !sed -i -e 's/\\302\(\\2[4-7][0-7]\)/\1/g' -e 's/\\303\\2\([0-7][0-7]\)/\\3\1/g' file.eps
    

    Thus, \303\266 becomes \366=246=ö. You can directly type the non-ASCII characters in MATLAB.

  2. Change the MATLAB character set encoding slCharacterEncoding('ISO-8859-1') before adding text to the figure and, if you add text from the Command window, use char(number) for non-ASCII characters. If you add text directly in the figure with the plot tools, you can enter the non-ASCII characters. This solution is not ideal because the non-ASCII characters do not appear on the figure in the default font (Helvetica by default with MATLAB on Linux) and it requires to use char(number) if you script the creation of the figure.

  3. Render the text later with LaTex by using a user-submitted MATLAB function such as LaPrint or one of its forks, which creates a tex-file with the text of the figure and an eps-file with the non-text part of the figure. A similar solution is matlab2tikz which creates a tikz/pgfplot file and a tex file.

  4. Use the Latex interpreter of MATLAB: \"{o}. MATLAB creates the character by combining the ASCII character with its diacritic but the result is low quality because of bad relative positioning (the diacritic is a bit too much on the right compared to the character). MATLAB uses the glyphs from Computer Modern font and embeds the font in the eps file (which adds ~ 80 Ko). Furthermore, the raw text in the pdf created from the eps does not contain ö but o ̈.

Exporting non-ISO-8859-1 characters

For exporting characters that are not in ISO-8859-1, which was asked on here, there is probably a reasonable solution if the number of characters needed is less than 256 (8-bit format) and ideally in a standard encoding set. It involves the following steps:

  1. Convert the octal code into the Unicode character;
  2. Save the file into the target encoding standard (in a 8-bit format);
  3. Add the encoding vector for the target encoding set.

For example, if you want to export Polish text, you need to convert the file into ISO-8859-2. Here is an implementation on Linux with Bash:

#!/bin/bash
name=$(basename "$1" .eps)
ascii2uni -a K "$1" > /tmp/eps_uni.eps
iconv -t ISO-8859-2 /tmp/eps_uni.eps -o "$name"_latin2.eps
sed -i -e '/%EndPageSetup/ r ISOLatin2Encoding.ps' -e 's/ISOLatin1Encoding/MyEncoding/' "$name"_latin2.eps

saved as eps_lat2; then running the command sh eps_lat2 file.eps creates file_latin2.eps with Latin-2 encoding. The file ISOLatin2Encoding.ps contains this:

/MyEncoding
% The first 144 entries are the same as the ISO Latin-1 encoding.
ISOLatin1Encoding 0 144 getinterval aload pop
% \22x
    /.notdef /.notdef /.notdef /.notdef /.notdef /.notdef /.notdef /.notdef
    /.notdef /.notdef /.notdef /.notdef /.notdef /.notdef /.notdef /.notdef
% \24x
    /nbspace /Aogonek /breve /Lslash /currency /Lcaron /Sacute /section
    /dieresis /Scaron /Scedilla /Tcaron /Zacute /hyphen /Zcaron /Zdotaccent
    /degree /aogonek /ogonek /lslash /acute /lcaron /sacute /caron
    /cedilla /scaron /scedilla /tcaron /zacute /hungarumlaut /zcaron /zdotaccent
% \30x
    /Racute /Aacute /Acircumflex /Abreve /Adieresis /Lacute /Cacute /Ccedilla
    /Ccaron /Eacute /Eogonek /Edieresis /Ecaron /Iacute /Icircumflex /Dcaron
    /Dcroat /Nacute /Ncaron /Oacute /Ocircumflex /Ohungarumlaut /Odieresis /multiply
    /Rcaron /Uring /Uacute /Uhungarumlaut /Udieresis /Yacute /Tcedilla /germandbls
% \34x
    /racute /aacute /acircumflex /abreve /adieresis /lacute /cacute /ccedilla
    /ccaron /eacute /eogonek /edieresis /ecaron /iacute /icircumflex /dcaron
    /dcroat /nacute /ncaron /oacute /ocircumflex /ohungarumlaut /odieresis /divide
    /rcaron /uring /uacute /uhungarumlaut /udieresis /yacute /tcedilla /dotaccent
256 packedarray def

Here is another implementation with Python (so it can work also on Windows and Mac):

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys,codecs
input = sys.argv[1]
fo = codecs.open(input[:-4]+'_latin2.eps','w','latin2')
with codecs.open(input,'r','string_escape') as fi:
    data = fi.readlines()
with open('ISOLatin2Encoding.ps') as fenc:
    for line in data:
        fo.write(line.decode('utf-8').replace('ISOLatin1Encoding','MyEncoding'))
        if line.startswith('%%EndPageSetup'):
            fo.write(fenc.read())
fo.close()

saved as eps_lat2.py; then running the command python eps_lat2.py file.eps creates file_latin2.eps with Latin-2 encoding.

It can easily be adapted to other 8-bit encoding standards by changing the encoding vector and the iconv (or codecs.open) parameter in the script.