Reading a character string of unknown length

user2053072 picture user2053072 · Feb 8, 2013 · Viewed 25.4k times · Source

I have been tasked with writing a Fortran 95 program that will read character input from a file, and then (to start with) simply spit it back out again. The tricky part is that these lines of input are of varying length (no maximum length given) and there can be any number of lines within the file.

I've used

    do
      read( 1, *, iostat = IO ) DNA    ! reads to EOF -- GOOD!!
      if ( IO < 0 ) exit               ! if EOF is reached, exit do
      I = I + 1
      NumRec = I                       ! used later for total no. of records
      allocate( Seq(I) )
      Seq(I) = DNA
      print*, I, Seq(I)
      X = Len_Trim( Seq(I) )           ! length of individual sequence
      print*, 'Sequence size: ', X
      print*
    end do

However, my initial statements list

    character(100), dimension(:), allocatable :: Seq
    character(100)  DNA

and the appropriate integers etc.

I guess what I'm asking is if there is any way to NOT list the size of the character strings in the first instance. Say I've got a string of DNA that is 200+ characters, and then another that is only 25, is there a way that the program can just read what there is and not need to include all the additional blanks? Can this be done without needing to use len_trim, since it can't be referenced in the declaration statements?

Answer

IanH picture IanH · Feb 8, 2013

To progressively read a record in Fortran 95, use non-advancing input. For example:

CHARACTER(10) :: buffer
INTEGER :: size
READ (unit, "(A)", ADVANCE='NO', SIZE=size, EOR=10, END=20) buffer

will read up to 10 characters worth (the length of buffer) each time it is called. The file position will only advance to the next record (the next line) once the entire record has been read by a series of one or more non-advancing reads.

Barring an end of file condition, the size variable will be defined with the actual number of characters read into buffer each time the read statement is executed.

The EOR and END and specifiers are used to control execution flow (execution will jump to the appropriately labelled statement) when end of record or end of file conditions occur respectively. You can also use an IOSTAT specifier to detect these conditions, but the particular negative values to use for the two conditions are processor dependent.

You can sum size within a particular record to work out the length of that particular record.

Wrap such a non-advancing read in a loop that appropriately detects for end of file and end of record and you have the incremental reading part.

In Fortran 95, the length specification for a local character variable must be a specification expression - essentially an expression that can be safely evaluated prior to the first executable statement of the scope that contains the variable's declaration. Constants represent the simplest case, but a specification expression in a procedure can involve dummy arguments of that procedure, amongst other things.

Reading the entire record of arbitrary length in is then a multi stage process:

  • Determine the length of the current record by using a series of incremental reads. These incremental reads for a particular record finish when the end of record condition occurs, at which time the file position will have moved to the next record.
  • Backspace the file back to the record of interest.
  • Call a procedure, passing the length of the current record as a dummy argument. Inside that procedure have an character variable whose length is given by the dummy argument.
  • Inside that called procedure, read the current record into that character variable using normal advancing input.
  • Carry out further processing on that character variable!

Note that each record ends up being read twice - once to determine its length, the second to actually read the data into the correctly "lengthed" character variable.

Alternative approaches exist that use allocatable (or automatic) character arrays of length one. The overall strategy is the same. Look at the code of the Get procedures in the common ISO_VARYING_STRING implementation for an example.

Fortran 2003 introduces deferred length character variables, which can have their length specified by an arbitrary expression in an allocate statement or, for allocatable variables, by the length of the right hand side in an assignment statement. This (in conjunction with other "allocatable" enhancements) allows the progressive read that determines the record length to also build the character variable that holds the contents of the record. Your supervisor needs to bring his Fortran environment up to date.