How can we eliminate junk value in field?

lamwaiman1988 picture lamwaiman1988 · Jul 28, 2011 · Viewed 11.1k times · Source

I have some csv record which are variable in length , for example:

0005464560,45667759,ZAMTR,!To ACC 12345678,DR,79.85

0006786565,34567899,ZAMTR,!To ACC 26575443,DR,1000

I need to seperate each of these fields and I need the last field which should be a money.

However, as I read the file, and unstring the record into fields, I found that the last field contain junk value at the end of itself. The amount(money) field should be 8 characters, 5 digit at the front, 1 dot, 2 digit at the end. The values from the input could be any value such as 13.5, 1000 and 354.23 .

    "FILE SECTION"

        FD INPUT_FILE.
            01 INPUT_REC                                   PIC X(66).

    "WORKING STORAGE SECTion"

            01 WS_INPUT_REC                                 PIC X(66).

            01 WS_AMOUNT_NUM                                PIC 9(5).9(2).
            01 WS_AMOUNT_TXT                                PIC X(8).

"MAIN SECTION"

                        UNSTRING INPUT_REC DELIMITED BY ","
                        INTO WS_ID_1, WS_ID_2, WS_CODE, WS_DESCRIPTION, WS_FLAG, WS_AMOUNT_TXT

                        MOVE WS_AMOUNT_TXT(1:8) TO WS_AMOUNT_NUM(1:8)

                        DISPLAY WS_AMOUNT_NUM

From the display, the value is rather normal: 345.23, 1000, just as what are, however, after I wrote the field into a file, here is what they become:

79.85^M^@^@ 137.35^M^@

I have inspect the field WS_AMOUNT_NUM, which came from the field WS_AMOUNT_TXT, and found that ^@ is a kind of LOW-VALUE. However, I cannot find what is ^M, it is not a space, not a high-value.

Answer

NealB picture NealB · Jul 28, 2011

I am guessing, but it looks like you may be reading variable length records from a file into a fixed length COBOL record. The junk at the end of the COBOL record is giving you some grief. Hard to say how consistent that junk is going to be from one read to the next (data beyond the bounds of actual input record length are technically undefined). That junk ends up being included in WS_AMOUNT_TXT after the UNSTRING

There are a number of ways to solve this problem. The suggestion I am giving you here may not be optimal, but it is simple and should get the job done.

The last INTO field, WS_AMOUNT_TXT, in your UNSTRING statement is the one that receives all of the trailing junk. That junk needs to be stripped off. Knowing that the only valid characters in the last field are digits and the decimal character, you could clean it up as follows:

PERFORM VARYING WS_I FROM LENGTH OF WS_AMOUNT_TXT BY -1
          UNTIL WS_I = ZERO
    IF WS_AMOUNT_TXT(WS_I:1) IS NUMERIC OR
       WS_AMOUNT_TXT(WS_I:1) = '.'
       MOVE ZERO TO WS_I
    ELSE
       MOVE SPACE TO WS_AMOUNT_TXT(WS_I:1)
    END-IF
END-PERFORM

The basic idea in the above code is to scan from the end of the last UNSTRING output field to the beginning replacing anything that is not a valid digit or decimal point with a space. Once a valid digit/decimal is found, exit the loop on the assumption that the rest will be valid.

After cleanup use the intrinsic function NUMVAL as outlined in my answer to your previous question to convert WS_AMOUNT_TXT into a numeric data type.

One final piece of advice, MOVE SPACES TO INPUT_REC before each READ to blow away data left over from a previous read that might be left in the buffer. This will protect you when reading a very "short" record after a "long" one - otherwise you may trip over data left over from the previous read.

Hope this helps.

EDIT Just noticed this answer to your question about reading variable length files. Using a variable length input record is a better approach. Given the actual input record length you can do something like:

UNSTRING INPUT_REC(1:REC_LEN) INTO...

Where REC_LEN is the variable specified after OCCURS DEPENDING ON for the INPUT_REC file FD. All the junk you are encountering occurs after the end of the record as defined by REC_LEN. Using reference modification as illustrated above trims it off before UNSTRING does its work to separate out the individual data fields.

EDIT 2: Cannot use reference modification with UNSTRING. Darn... It is possible with some other COBOL dialects but not with OpenVMS COBOL. Try the following:

MOVE INPUT_REC(1:REC_LEN) TO WS_BUFFER
UNSTRING WS_BUFFER INTO...

Where WS_BUFFER is a working storage PIC X variable long enough to hold the longest input record. When you MOVE a short alpha-numeric field to a longer one, the destination field is left justified with spaces used to pad remaining space (ie. WS_BUFFER). Since leading and trailing spaces are acceptable to the NUMVAL fucnction you have exactly what you need.

I have a reason for pushing you in this direction. Any junk that ends up at the trailing end of a record buffer when reading a short record is undefined. There is a possibility that some of that junk just might end up being a digit or a decimal point. Should this occur, the cleanup routine I originally suggested would fail.

EDIT 3: There are no ^@ in the resulting WS_AMOUNT_TXT, but still there are a ^M

Looks like the file system is treating <CR> (that ^M thing) at the end of each record as data.

If the file you are reading came from a Windows platform and you are now reading it on a UNIX platform that would explain the problem. Under Windows records are terminated with <CR><LF> while on UNIX they are terminated with <LF> only. The UNIX file system treats <CR> as if it were part of the record.

If this is the case, you can be pretty sure that there will be a single <CR> at the end of every record read. There are a number of ways to deal with this:

Method 1: As you already noted, pre-edit the file using Notepad++ or some other tool to remove the <CR> characters before processing through your COBOL program. Personally I don't think this is the best way of going about it. I prefer to use a COBOL only solution since it involves fewer processing steps.

Method 2: Trim the last character from each input record before processing it. The last character should always be <CR>. Try the following if you are reading records as variable length and have the actual input record length available.

SUBTRACT 1 FROM REC_LEN
MOVE INPUT_REC(1:REC_LEN) TO WS_BUFFER
UNSTRING WS_BUFFER INTO...

Method 3: Treat <CR> as a delimiter when UNSTRINGing as follows:

UNSTRING INPUT_REC DELIMITED BY "," OR x"0D"
    INTO WS_ID_1, WS_ID_2, WS_CODE, WS_DESCRIPTION, WS_FLAG, WS_AMOUNT_TXT

Method 4: Condition the last receiving field from UNSTRING by replacing trailing non digit/non decimal point characters with spaces. I outlined this solution a litte earlier in this question. You could also explore the INSPECT statement using the REPLACING option (Format 2). This should be able to do pretty much the same thing - just replace all x"00" by SPACE and x"0D" by SPACE.

Where there is a will, there is a way. Any of the above solutions should work for you. Choose the one you are most comfortable with.