Replace sequence of bytes in binary file

Tomas picture Tomas · Jun 29, 2011 · Viewed 21k times · Source

What is the best method to replace sequence of bytes in binary file to the same length of other bytes? The binary files will be pretty large, about 50 mb and should not be loaded at once in memory.

Update: I do not know location of bytes which needs to be replaced, I need to find them first.

Answer

Jon Skeet picture Jon Skeet · Jun 29, 2011

Assuming you're trying to replace a known section of the file.

  • Open a FileStream with read/write access
  • Seek to the right place
  • Overwrite existing data

Sample code coming...

public static void ReplaceData(string filename, int position, byte[] data)
{
    using (Stream stream = File.Open(filename, FileMode.Open))
    {
        stream.Position = position;
        stream.Write(data, 0, data.Length);
    }
}

If you're effectively trying to do a binary version of a string.Replace (e.g. "always replace bytes { 51, 20, 34} with { 20, 35, 15 } then it's rather harder. As a quick description of what you'd do:

  • Allocate a buffer of at least the size of data you're interested in
  • Repeatedly read into the buffer, scanning for the data
  • If you find a match, seek back to the right place (e.g. stream.Position -= buffer.Length - indexWithinBuffer; and overwrite the data

Sounds simple so far... but the tricky bit is if the data starts near the end of the buffer. You need to remember all potential matches and how far you've matched so far, so that if you get a match when you read the next buffer's-worth, you can detect it.

There are probably ways of avoiding this trickiness, but I wouldn't like to try to come up with them offhand :)

EDIT: Okay, I've got an idea which might help...

  • Keep a buffer which is at least twice as big as you need
  • Repeatedly:
    • Copy the second half of the buffer into the first half
    • Fill the second half of the buffer from the file
    • Search throughout the whole buffer for the data you're looking for

That way at some point, if the data is present, it will be completely within the buffer.

You'd need to be careful about where the stream was in order to get back to the right place, but I think this should work. It would be trickier if you were trying to find all matches, but at least the first match should be reasonably simple...