I'm working with some multi-gigabyte text files and want to do some stream processing on them using PowerShell. It's simple stuff, just parsing each line and pulling out some data, then storing it in a database.
Unfortunately, get-content | %{ whatever($_) }
appears to keep the entire set of lines at this stage of the pipe in memory. It's also surprisingly slow, taking a very long time to actually read it all in.
So my question is two parts:
get-content
appears to be 100x slower than a C# script.I'm hoping there's something dumb I'm doing here, like missing a -LineBufferSize
parameter or something...
If you are really about to work on multi-gigabyte text files then do not use PowerShell. Even if you find a way to read it faster processing of huge amount of lines will be slow in PowerShell anyway and you cannot avoid this. Even simple loops are expensive, say for 10 million iterations (quite real in your case) we have:
# "empty" loop: takes 10 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) {} }
# "simple" job, just output: takes 20 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i } }
# "more real job": 107 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i.ToString() -match '1' } }
UPDATE: If you are still not scared then try to use the .NET reader:
$reader = [System.IO.File]::OpenText("my.log")
try {
for() {
$line = $reader.ReadLine()
if ($line -eq $null) { break }
# process the line
$line
}
}
finally {
$reader.Close()
}
UPDATE 2
There are comments about possibly better / shorter code. There is nothing wrong with the original code with for
and it is not pseudo-code. But the shorter (shortest?) variant of the reading loop is
$reader = [System.IO.File]::OpenText("my.log")
while($null -ne ($line = $reader.ReadLine())) {
$line
}