I wanted to know what the best way to write a large fortran array ( 5000 x 5000 real single precision numbers) to a file. I am trying to save the results of a numerical calculation for later use so they do not need to be repeated. From calculation 5000 x 5000 x 4bytes per number number is 100 Mb, is it possible to save this in a form that is only 100Mb? Is there a way to save fortran arrays as a binary file and read it back in for later use?
I've noticed that saving numbers to a text file produces a file much larger than the size of the data type being saved. Is this because the numbers are being saved as characters?
The only way I am familiar with to write to file is
open (unit=41, file='outfile.txt')
do i=1,len
do j=1,len
write(41,*) Array(i,j)
end do
end do
Although I'd imagine there is a better way to do it. If anyone could point me to some resources or examples to approve my ability to write and read larger files efficiently (in terms of memory) that would be great. Thanks!
Write data files in binary, unless you're going to actually be reading the output - and you're not going to be reading a 2.5 million-element array.
The reasons for using binary are threefold, in decreasing importance:
Accuracy concerns may be the most obvious. When you are converting a (binary) floating point number to a string representation of the decimal number, you are inevitably going to truncate at some point. That's ok if you are sure that when you read the text value back into a floating point value, you are certainly going to get the same value; but that is actually a subtle question and requires choosing your format carefully. Using default formatting, various compilers perform this task with varying degrees of quality. This blog post, written from the point of view of a games programmer, does a good job of covering the issues.
Let's consider a little program which, for a variety of formats, writes a single-precision real number out to a string, and then reads it back in again, keeping track of the maximum error it encounters. We'll just go from 0 to 1, in units of machine epsilon. The code follows:
program testaccuracy
character(len=128) :: teststring
integer, parameter :: nformats=4
character(len=20), parameter :: formats(nformats) = &
[ '( E11.4)', '( E13.6)', '( E15.8)', '(E17.10)' ]
real, dimension(nformats) :: errors
real :: output, back
real, parameter :: delta=epsilon(output)
integer :: i
errors = 0
output = 0
do while (output < 1)
do i=1,nformats
write(teststring,FMT=formats(i)) output
read(teststring,*) back
if (abs(back-output) > errors(i)) errors(i) = abs(back-output)
enddo
output = output + delta
end do
print *, 'Maximum errors: '
print *, formats
print *, errors
print *, 'Trying with default format: '
errors = 0
output = 0
do while (output < 1)
write(teststring,*) output
read(teststring,*) back
if (abs(back-output) > errors(1)) errors(1) = abs(back-output)
output = output + delta
end do
print *, 'Error = ', errors(1)
end program testaccuracy
and when we run it, we get:
$ ./accuracy
Maximum errors:
( E11.4) ( E13.6) ( E15.8) (E17.10)
5.00082970E-05 5.06639481E-07 7.45058060E-09 0.0000000
Trying with default format:
Error = 7.45058060E-09
Note that even using a format with 8 digits after the decimal place - which we might think would be plenty, given that single precision reals are only accurate to 6-7 decimal places - we don't get exact copies back, off by approximately 1e-8. And this compiler's default format does not give us accurate round-trip floating point values; some error is introduced! If you're a video-game programmer, that level of accuracy may well be enough. If you're doing time-dependant simulations of turbulent fluids, however, that might absolutely not be ok, particularly if there's some bias to where the error is introduced, or if the error occurs in what is supposed to be a conserved quantity.
Note that if you try running this code, you'll notice that it takes a surprisingly long time to finish. That's because, maybe surprisingly, performance is another real issue with text output of floating point numbers. Consider the following simple program, which just writes out your example of a 5000 × 5000 real array as text and as unformatted binary:
program testarray
implicit none
integer, parameter :: asize=5000
real, dimension(asize,asize) :: array
integer :: i, j
integer :: time, u
forall (i=1:asize, j=1:asize) array(i,j)=i*asize+j
call tick(time)
open(newunit=u,file='test.txt')
do i=1,asize
write(u,*) (array(i,j), j=1,asize)
enddo
close(u)
print *, 'ASCII: time = ', tock(time)
call tick(time)
open(newunit=u,file='test.dat',form='unformatted')
write(u) array
close(u)
print *, 'Binary: time = ', tock(time)
contains
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
end program testarray
Here are the timing outputs, for writing to disk or to ramdisk:
Disk:
ASCII: time = 41.193001
Binary: time = 0.11700000
Ramdisk
ASCII: time = 40.789001
Binary: time = 5.70000000E-02
Note that when writing to disk, the binary output is 352 times as fast as ASCII, and to ramdisk it's closer to 700 times. There are two reasons for this - one is that you can write out data all at once, rather than having to loop; the other is that generating the string decimal representation of a floating point number is a surprisingly subtle operation which requires a significant amount of computing for each value.
Finally, is data size; the text file in the above example comes out (on my system) to about 4 times the size of the binary file.
Now, there are real problems with binary output. In particular, raw Fortran (or, for that matter, C) binary output is very brittle. If you change platforms, or your data size changes, your output may no longer be any good. Adding new variables to the output will break the file format unless you always add new data at the end of the file, and you have no way of knowing ahead of time what variables are in a binary blob you get from your collaborator (who might be you, three months ago). Most of the downsides of binary output are avoided by using libraries like NetCDF, which write self-describing binary files that are much more "future proof" than raw binary. Better still, since it's a standard, many tools read NetCDF files.
There are many NetCDF tutorials on the internet; ours is here. A simple example using NetCDF gives similar times to raw binary:
$ ./array
ASCII: time = 40.676998
Binary: time = 4.30000015E-02
NetCDF: time = 0.16000000
but gives you a nice self-describing file:
$ ncdump -h test.nc
netcdf test {
dimensions:
X = 5000 ;
Y = 5000 ;
variables:
float Array(Y, X) ;
Array:units = "ergs" ;
}
and file sizes about the same as raw binary:
$ du -sh test.*
96M test.dat
96M test.nc
382M test.txt
the code follows:
program testarray
implicit none
integer, parameter :: asize=5000
real, dimension(asize,asize) :: array
integer :: i, j
integer :: time, u
forall (i=1:asize, j=1:asize) array(i,j)=i*asize+j
call tick(time)
open(newunit=u,file='test.txt')
do i=1,asize
write(u,*) (array(i,j), j=1,asize)
enddo
close(u)
print *, 'ASCII: time = ', tock(time)
call tick(time)
open(newunit=u,file='test.dat',form='unformatted')
write(u) array
close(u)
print *, 'Binary: time = ', tock(time)
call tick(time)
call writenetcdffile(array)
print *, 'NetCDF: time = ', tock(time)
contains
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
subroutine writenetcdffile(array)
use netcdf
implicit none
real, intent(IN), dimension(:,:) :: array
integer :: file_id, xdim_id, ydim_id
integer :: array_id
integer, dimension(2) :: arrdims
character(len=*), parameter :: arrunit = 'ergs'
integer :: i, j
integer :: ierr
i = size(array,1)
j = size(array,2)
! create the file
ierr = nf90_create(path='test.nc', cmode=NF90_CLOBBER, ncid=file_id)
! define the dimensions
ierr = nf90_def_dim(file_id, 'X', i, xdim_id)
ierr = nf90_def_dim(file_id, 'Y', j, ydim_id)
! now that the dimensions are defined, we can define variables on them,...
arrdims = (/ xdim_id, ydim_id /)
ierr = nf90_def_var(file_id, 'Array', NF90_REAL, arrdims, array_id)
! ...and assign units to them as an attribute
ierr = nf90_put_att(file_id, array_id, "units", arrunit)
! done defining
ierr = nf90_enddef(file_id)
! Write out the values
ierr = nf90_put_var(file_id, array_id, array)
! close; done
ierr = nf90_close(file_id)
return
end subroutine writenetcdffile
end program testarray