What codepage/charset should be used to interpret data coming from an MVS system into a Java Environment?

user100645 picture user100645 · May 4, 2009 · Viewed 7.5k times · Source

I've come into an interesting problem (as is often the case in interacting with legacy systems). I'm working on an application (which currently runs on a x86 Linux or Windows system) that can receive requests from a variety of systems, one of them being an MVS system.

I am attempting to determine which codepage/charset I should be using to interpret request data coming from the MVS system.

In the past, I've used 'cp500' (IBM-500) to interpret byte date coming for z/OS systems, however I fear that since MVS is a bit of a legacy system, and that since IBM seemed to change it's mind consistently with respect to what encoding to use (there must be tens of EBCDIC encodings), that cp500 may not be the correct encoding.

The best resource I've found on character sets in Java is: http://mindprod.com/jgloss/encoding . However from this site, and IBM Infocenters, I have not been able to get a clear answer.

EDIT: Added from my response to Pax below:

There was a glaring hole in my question in the origin of the request data. In this case, the origin of the data is through a Websphere MQ interface. Websphere MQ does have facilities for translating to the proper encoding, however that is only for reading the data using MQMessage.readString(), which has since been deprecated. I would prefer to use this, however I am using a proprietary interface framework in which I can't change how the message is read off the MQQueue, which is reading bytes directly off the Queue and thus I am left handle translation.

Final Answer: I wanted to follow up on this. It turns out the correct Character Set was indeed cp500 (IBM-500). However, i'm under the impression that results may vary. Some tips for anyone else with the same issue:

Utilize Charset.availableCharsets();. This will give you a map of Supported Character Sets in your run time. I iterated through these sets and printed out my byte data translated into that character set. While it didn't give me the answer I wanted (mainly because I wasn't able to read data as it was coming in), I imagine it could be helpful for others.

Refer to: http://mindprod.com/jgloss/encoding for a list of supported char sets.

Lastly, though I have not confirmed this, but ensure you are using the right JRE. I'm thinking that the IBM Runtimes support more EBCDIC character sets then OpenJDK or Sun's Runtimes.

Answer

paxdiablo picture paxdiablo · May 4, 2009

"MVS is a bit of a legacy system"? Ha! It's still the OS of choice for applications where reliability is the number one concern. Now on to your question :-)

It depends entirely on what is generating the data. For example, if you're just downloading files from the host, the FTP negotiation may handle it. But since you mention Java, it's probably connecting via JDBC to DB2/z, and the JDBC drivers will handle it quite well (much better if you're using IBM's own JRE rather than the Sun version).

EBCDIC itself on the host has quite a few different encodings so you need to at least let us know where the data is coming from. Recent versions of DB2 have no issue with storing Unicode in the database which would alleviate all your concerns.

First task, find out where the data is coming from and get the encoding from your SysProg (if it's not automatically handled).

Update:

Andrew, based on your added text where you state you can't use the provided translations, you're going to have to use the manual method. You need to identify the source of the data and get the CCSID from that. Then do the translation to and from Unicode (or whatever code page you're using if not Unicode) manually.

CCSID 500 is the default code page for EBCDIC International (no Euro) but these machines are used all over the planet. z/OS conversion services is how you usually do the conversion on the mainframe.

Although this is an iSeries page, it lists a huge number of CCSIDs and their glyphs, applicable to the mainframe as well.

You probably just need to figure out whether you're using CCSID 500 or 37 (or one of the foreign language versions) and work out the mapping with Unicode CCSID 1208. Your SysProg will be able to tell you which one. If you're working for a US company, it probably 500 or 37, but IBM expends a great deal of effort supporting multiple code pages. I'll be glad when all their mainframe software stores and uses Unicode by default, it'll make things much easier.