如何将 PowerShell 中的文件作为流逐行处理

我正在处理一些千兆字节的文本文件,希望使用 PowerShell 对它们进行一些流处理。这很简单,只要解析每一行并提取一些数据,然后将其存储在数据库中。

不幸的是,get-content | %{ whatever($_) }似乎将管道的这个阶段的整个行集保存在内存中。它的速度也慢得惊人,实际上要花很长时间才能读完。

所以我的问题分为两部分:

  1. 我怎样才能让它逐行处理流,而不是把整个事情缓冲在内存中?为此,我希望避免使用几 GB 的 RAM。
  2. 我怎样才能让它跑得更快?PowerShell 在 get-content上的迭代似乎比 C # 脚本慢100倍。

我希望我在这里做了什么蠢事比如错过了 -LineBufferSize参数之类的。

281306 次浏览

If you are really about to work on multi-gigabyte text files then do not use PowerShell. Even if you find a way to read it faster processing of huge amount of lines will be slow in PowerShell anyway and you cannot avoid this. Even simple loops are expensive, say for 10 million iterations (quite real in your case) we have:

# "empty" loop: takes 10 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) {} }


# "simple" job, just output: takes 20 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i } }


# "more real job": 107 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i.ToString() -match '1' } }

UPDATE: If you are still not scared then try to use the .NET reader:

$reader = [System.IO.File]::OpenText("my.log")
try {
for() {
$line = $reader.ReadLine()
if ($line -eq $null) { break }
# process the line
$line
}
}
finally {
$reader.Close()
}

UPDATE 2

There are comments about possibly better / shorter code. There is nothing wrong with the original code with for and it is not pseudo-code. But the shorter (shortest?) variant of the reading loop is

$reader = [System.IO.File]::OpenText("my.log")
while($null -ne ($line = $reader.ReadLine())) {
$line
}

System.IO.File.ReadLines() is perfect for this scenario. It returns all the lines of a file, but lets you begin iterating over the lines immediately which means it does not have to store the entire contents in memory.

Requires .NET 4.0 or higher.

foreach ($line in [System.IO.File]::ReadLines($filename)) {
# do something with $line
}

http://msdn.microsoft.com/en-us/library/dd383503.aspx

If you want to use straight PowerShell check out the below code.

$content = Get-Content C:\Users\You\Documents\test.txt
foreach ($line in $content)
{
Write-Host $line
}