Node reading file in specified chunk size

kjs3 picture kjs3 · Aug 4, 2014 · Viewed 15.9k times · Source

The goal: Upload large files to AWS Glacier without holding the whole file in memory.

I'm currently uploading to glacier now using fs.readFileSync() and things are working. But, I need to handle files larger than 4GB and I'd like to upload multiple chunks in parallel. This means moving to multipart uploads. I can choose the chunk size but then glacier needs every chunk to be the same size (except the last)

This thread suggests that I can set a chunk size on a read stream but that I'm not actually guaranteed to get it.

Any info on how I can get consistent parts without reading the whole file into memory and splitting it up manually?

Assuming I can get to that point I was just going to use cluster with a few processes pulling off the stream as fast as they can upload to AWS. If that seems like the wrong way to parallelize the work I'd love suggestions there.

Answer

mscdex picture mscdex · Aug 4, 2014

If nothing else you can just use fs.open(), fs.read(), and fs.close() manually. Example:

var CHUNK_SIZE = 10 * 1024 * 1024, // 10MB
    buffer = Buffer.alloc(CHUNK_SIZE),
    filePath = '/tmp/foo';

fs.open(filePath, 'r', function(err, fd) {
  if (err) throw err;
  function readNextChunk() {
    fs.read(fd, buffer, 0, CHUNK_SIZE, null, function(err, nread) {
      if (err) throw err;

      if (nread === 0) {
        // done reading file, do any necessary finalization steps

        fs.close(fd, function(err) {
          if (err) throw err;
        });
        return;
      }

      var data;
      if (nread < CHUNK_SIZE)
        data = buffer.slice(0, nread);
      else
        data = buffer;

      // do something with `data`, then call `readNextChunk();`
    });
  }
  readNextChunk();
});