What to put in a binary data file's header

KeithB picture KeithB · Jan 6, 2009 · Viewed 9.1k times · Source

I have a simulation that reads large binary data files that we create (10s to 100s of GB). We use binary for speed reasons. These files are system dependent, converted from text files on each system that we run, so I'm not concerned about portability. The files currently are many instances of a POD struct, written with fwrite.

I need to change the struct, so I want to add a header that has a file version number in it, which will be incremented anytime the struct changes. Since I'm doing this, I want to add some other information as well. I'm thinking of the size of the struct, byte order, and maybe the svn version number of the code that created the binary file. Is there anything else that would be useful to add?

Answer

Roddy picture Roddy · Jan 6, 2009

In my experience, second-guessing the data you'll need is invariably wasted time. What's important is to structure your metadata in a way that is extensible. For XML files, that's straightforward, but binary files require a bit more thought.

I tend to store metadata in a structure at the END of the file, not the beginning. This has two advantages:

  • Truncated/unterminated files are easily detected.
  • Metadata footers can often be appended to existing files without impacting their reading code.

The simplest metadata footer I use looks something like this:

struct MetadataFooter{
  char[40] creatorVersion;
  char[40] creatorApplication;
  .. or whatever
} 

struct FileFooter
{
  int64 metadataFooterSize;  // = sizeof(MetadataFooter)
  char[10] magicString;   // a unique identifier for the format: maybe "MYFILEFMT"
};

After the raw data, the metadata footer and THEN the file footer are written.

When reading the file, seek to the end - sizeof(FileFooter). Read the footer, and verify the magicString. Then, seek back according to metadataFooterSize and read the metadata. Depending on the footer size contained in the file, you can use default values for missing fields.

As KeithB points out, you could even use this technique to store the metadata as an XML string, giving the advantages of both totally extensible metadata, with the compactness and speed of binary data.