createReadStream in Node.JS

user3421904 picture user3421904 · Jun 2, 2015 · Viewed 20.9k times · Source

So I used fs.readFile() and it gives me

"FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory"

since fs.readFile() loads the whole file into memory before calling the callback, should I use fs.createReadStream() instead?

That's what I was doing previously with readFile:

fs.readFile('myfile.json', function (err1, data) {
    if (err1) {
        console.error(err1);
    } else {
        var myData = JSON.parse(data);
        //Do some operation on myData here
    }
}

Sorry, I'm kind of new to streaming; is the following the right way to do the same thing but with streaming?

var readStream = fs.createReadStream('myfile.json');

readStream.on('end', function () {  
    readStream.close();
    var myData = JSON.parse(readStream);
    //Do some operation on myData here
});

Thanks

Answer

Chev picture Chev · Jun 2, 2015

If the file is enormous then yes, streaming will be how you want to deal with it. However, what you're doing in your second example is letting the stream buffer all the file data into memory and then handling it on end. It's essentially no different than readFile that way.

You'll want to check out JSONStream. What streaming means is that you want to deal with the data as it flows by. In your case you obviously have to do this because you cannot buffer the entire file into memory all at once. With that in mind, hopefully code like this makes sense:

JSONStream.parse('rows.*.doc')

Notice that it has a kind of query pattern. That's because you will not have the entire JSON object/array from the file to work with all at once, so you have to think more in terms of how you want JSONStream to deal with the data as it finds it.

You can use JSONStream to essentially query for the JSON data that you are interested in. This way you're never buffering the whole file into memory. It does have the downside that if you do need all the data, then you'll have to stream the file multiple times, using JSONStream to pull out only the data you need right at that moment, but in your case you don't have much choice.

You could also use JSONStream to parse out data in order and do something like dump it into a database.

JSONStream.parse is similar to JSON.parse but instead of returning a whole object it returns a stream. When the parse stream gets enough data to form a whole object matching your query, it will emit a data event with the data being the document that matches your query. Once you've configured your data handler you can pipe your read stream into the parse stream and watch the magic happen.

Example:

var JSONStream = require('JSONStream');
var readStream = fs.createReadStream('myfile.json');
var parseStream = JSONStream.parse('rows.*.doc');
parseStream.on('data', function (doc) {
  db.insert(doc); // pseudo-code for inserting doc into a pretend database.
});
readStream.pipe(parseStream);

That's the verbose way to help you understand what's happening. Here is a more succinct way:

var JSONStream = require('JSONStream');
fs.createReadStream('myfile.json')
  .pipe(JSONStream.parse('rows.*.doc'))
  .on('data', function (doc) {
    db.insert(doc);
  });

Edit:

For further clarity about what's going on, try to think about it like this. Let's say you have a giant lake and you want to treat the water to purify it and move the water to a new reservoir. If you had a giant magical helicopter with a huge bucket then you could fly over the lake, put the lake in the bucket, add treatment chemicals to it, then fly it to its destination.

The problem of course being that there is no such helicopter that can deal with that much weight or volume. It's simply impossible, but that doesn't mean we can't accomplish our goal a different way. So instead you build a series of rivers (streams) between the lake and the new reservoir. You then set up cleansing stations in these rivers that purify any water that passes through it. These stations could operate in a variety of ways. Maybe the treatment can be done so fast that you can let the river flow freely and the purification will just happen as the water travels down the stream at maximum speed.

It's also possible that it takes some time for the water to be treated, or that the station needs a certain amount of water before it can effectively treat it. So you design your rivers to have gates and you control the flow of the water from the lake into your rivers, letting the stations buffer just the water they need until they've performed their job and released the purified water downstream and on to its final destination.

That's almost exactly what you want to do with your data. The parse stream is your cleansing station and it buffers data until it has enough to form a whole document that matches your query, then it pushes just that data downstream (and emits the data event).

Node streams are nice because most of the time you don't have to deal with opening and closing the gates. Node streams are smart enough to control backflow when the stream buffers a certain amount of data. It's as if the cleansing station and the gates on the lake are talking to each other to work out the perfect flow rate.

If you had a streaming database driver then you'd theoretically be able to create some kind of insert stream and then do parseStream.pipe(insertStream) instead of handling the data event manually :D. Here's an example of creating a filtered version of your JSON file, in another file.

fs.createReadStream('myfile.json')
  .pipe(JSONStream.parse('rows.*.doc'))
  .pipe(JSONStream.stringify())
  .pipe(fs.createWriteStream('filtered-myfile.json'));