Displaying UTF-8 characters in PDF

Themasterhimself picture Themasterhimself · Apr 16, 2013 · Viewed 19.8k times · Source

I am trying to display a PDF by converting it into a binary string from the backend. This is the ajax call I am making

    $.ajax({
        type : 'GET',
        url : '<url>',          
        data : oParameters,
        contentType : 'application/pdf;charset=UTF-8',
        success : function(odata) {

            window.open("data:application/pdf;charset=utf-8," + escape(odata));

} });

When I try to open the PDF in a new window, the url looks like

data:application/pdf;charset=utf-8,%25PDF-1.3%0D%0A%25%uFFFD%uFFFD%uFFFD%uFFFD%0D%0A2%200%20obj%0D%0A/WinAnsiEncoding%0D........

As you can see, it uses "WinAnsiEncoding" to display the PDF. Because of this, some of the characters are not being displayed properly. How do I change this to UTF-8?

EDIT : The backend is in ABAP. I am converting a smartform to OTF and then to a string using the function module "CONVERT_OTF".

           CALL FUNCTION fname
         EXPORTING
           user_settings      = space
           control_parameters = ls_ctropt
           output_options     = ls_output
           gv_lang            = lv_lang
         IMPORTING
           job_output_info    = ls_body_text
         EXCEPTIONS
           formatting_error   = 1
           internal_error     = 2
           send_error         = 3
           user_canceled      = 4
           OTHERS             = 5.

CALL FUNCTION 'CONVERT_OTF'
          EXPORTING
             format                = 'PDF' 
          IMPORTING
           bin_filesize          = ls_pdf_len
           bin_file              = ls_pdf_xstring
          TABLES
             otf                   = ls_body_text-otfdata
             lines                 = lt_lines
           EXCEPTIONS
             err_max_linewidth     = 1
             err_format            = 2
             err_conv_not_possible = 3
             err_bad_otf           = 4
             OTHERS                = 5.
   CALL METHOD server->response->set_header_field( name = 'Content-Type'
     value = 'application/pdf;charset=UTF-8' ).
   CALL METHOD server->response->append_data( data = lv_pdf_string
     length = lv_len ).

Answer

mkl picture mkl · Apr 17, 2013

Concerning your remark that it uses "WinAnsiEncoding" to display the PDF:

After the comma in

data:application/pdf;charset=utf-8,%25PDF-1.3%0D%0A%25%uFFFD%uFFFD%uFFFD%uFFFD%0D%0A2%200%20obj%0D%0A/WinAnsiEncoding%0D........

everything is pure data. Thus, "WinAnsiEncoding" is merely part of the content of the PDF, and if it is the reason of your troubles, the PDF generator must be asked to change his PDF generation process.

In the case at hand, your data is:

%PDF-1.3
%...
2 0 obj
/WinAnsiEncoding
........

which is completely normal PDF structure. It merely means that the PDF object 2 is defined as /WinAnsiEncoding which may or may not be used for some font definition, and even if it is used, it may still be adapted by some /Differences to include the characters you require. Furthermore it does not make sense to change this to UTF-8 (as you request) because UTF-8 is not a standard encoding for PDF page content. If you somehow put UTF-8 there, you'll break the PDF even more.

I'm afraid, though, that there are other problems, too.

  1. You add a charset parameter to the type application/pdf --- this does not make sense, PDF is a binary format, i.e. a sequence of bytes is expected and, therefore, no charset is involved.

  2. Your method call escape(odata) creates %uFFFD%uFFFD%uFFFD%uFFFD --- this is invalid according to the RFCs which only define

    A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. A percent-encoded octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing that octet's numeric value.

    (RFC 3986, section 2.1)

    Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI.

    (ibidem, section 2.4)

    Thus, %uFFFD%uFFFD%uFFFD%uFFFD is invalid.

  3. PDF being a binary format are better suited for Base64 encoding, i.e.

    data:application/pdf;base64,BASE_64_ENCODED_PDF
    

    Thus, I propose you change your client side process accordingly.