用 C # 中的流读取大文本文件

我已经完成了如何处理加载到应用程序脚本编辑器中的大文件这一可爱的任务(对于快速宏的内部产品,它类似于 VBA)。大多数文件是大约300-400 KB,这是良好的加载。但是当它们超过100MB 时,进程就会遇到困难(正如您所预料的)。

会发生的情况是,文件被读取并推送到一个 RichTextBox,然后导航-不要太担心这一部分。

编写初始代码的开发人员只是使用 StreamReader 并执行

[Reader].ReadToEnd()

这可能需要一段时间才能完成。

我的任务是将这段代码分解,将其分块读入缓冲区,并显示一个带有取消选项的进度条。

一些假设:

  • 大多数文件将是30-40 MB
  • 文件的内容是文本(不是二进制) ,有些是 Unix 格式,有些是 DOS 格式。
  • 一旦检索到内容,我们就知道使用了什么终结符。
  • 一旦它加载了在 Richtextbox 中渲染所需的时间,就没有人会担心了。这只是文本的初始载入。

现在的问题是:

  • 我是否可以简单地使用 StreamReader,然后检查 Llength 属性(如 ProgressMax)并发出一个 Read 来设置缓冲区大小,然后在后台 worker 内部的 while 循环 威尔斯特中迭代,这样它就不会阻塞主 UI 线程?然后在字符串构建器完成后将其返回给主线程。
  • 内容将发送到 StringBuilder。如果流的长度可用,我可以用流的大小初始化 StringBuilder 吗?

这些(在你的专业意见中)是好主意吗?我在过去读取 Streams 的内容时遇到过一些问题,因为它总是会遗漏最后几个字节或其他内容,但如果是这种情况,我将提出另一个问题。

234352 次浏览

Use a background worker and read only a limited number of lines. Read more only when the user scrolls.

And try to never use ReadToEnd(). It's one of the functions that you think "why did they make it?"; it's a script kiddies' helper that goes fine with small things, but as you see, it sucks for large files...

Those guys telling you to use StringBuilder need to read the MSDN more often:

Performance Considerations
The Concat and AppendFormat methods both concatenate new data to an existing String or StringBuilder object. A String object concatenation operation always creates a new object from the existing string and the new data. A StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer. The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs.
A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data. Consequently, the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.

That means huge allocation of memory, what becomes large use of swap files system, that simulates sections of your hard disk drive to act like the RAM memory, but a hard disk drive is very slow.

The StringBuilder option looks fine for who use the system as a mono-user, but when you have two or more users reading large files at the same time, you have a problem.

You might be better off to use memory-mapped files handling here.. The memory mapped file support will be around in .NET 4 (I think...I heard that through someone else talking about it), hence this wrapper which uses p/invokes to do the same job..

Edit: See here on the MSDN for how it works, here's the blog entry indicating how it is done in the upcoming .NET 4 when it comes out as release. The link I have given earlier on is a wrapper around the pinvoke to achieve this. You can map the entire file into memory, and view it like a sliding window when scrolling through the file.

Have a look at the following code snippet. You have mentioned Most files will be 30-40 MB. This claims to read 180 MB in 1.4 seconds on an Intel Quad Core:

private int _bufferSize = 16384;


private void ReadFile(string filename)
{
StringBuilder stringBuilder = new StringBuilder();
FileStream fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read);


using (StreamReader streamReader = new StreamReader(fileStream))
{
char[] fileContents = new char[_bufferSize];
int charsRead = streamReader.Read(fileContents, 0, _bufferSize);


// Can't do much with 0 bytes
if (charsRead == 0)
throw new Exception("File is 0 bytes");


while (charsRead > 0)
{
stringBuilder.Append(fileContents);
charsRead = streamReader.Read(fileContents, 0, _bufferSize);
}
}
}

Original Article

This should be enough to get you started.

class Program
{
static void Main(String[] args)
{
const int bufferSize = 1024;


var sb = new StringBuilder();
var buffer = new Char[bufferSize];
var length = 0L;
var totalRead = 0L;
var count = bufferSize;


using (var sr = new StreamReader(@"C:\Temp\file.txt"))
{
length = sr.BaseStream.Length;
while (count > 0)
{
count = sr.Read(buffer, 0, bufferSize);
sb.Append(buffer, 0, count);
totalRead += count;
}
}


Console.ReadKey();
}
}

You say you have been asked to show a progress bar while a large file is loading. Is that because the users genuinely want to see the exact % of file loading, or just because they want visual feedback that something is happening?

If the latter is true, then the solution becomes much simpler. Just do reader.ReadToEnd() on a background thread, and display a marquee-type progress bar instead of a proper one.

I raise this point because in my experience this is often the case. When you are writing a data processing program, then users will definitely be interested in a % complete figure, but for simple-but-slow UI updates, they are more likely to just want to know that the computer hasn't crashed. :-)

An iterator might be perfect for this type of work:

public static IEnumerable<int> LoadFileWithProgress(string filename, StringBuilder stringData)
{
const int charBufferSize = 4096;
using (FileStream fs = File.OpenRead(filename))
{
using (BinaryReader br = new BinaryReader(fs))
{
long length = fs.Length;
int numberOfChunks = Convert.ToInt32((length / charBufferSize)) + 1;
double iter = 100 / Convert.ToDouble(numberOfChunks);
double currentIter = 0;
yield return Convert.ToInt32(currentIter);
while (true)
{
char[] buffer = br.ReadChars(charBufferSize);
if (buffer.Length == 0) break;
stringData.Append(buffer);
currentIter += iter;
yield return Convert.ToInt32(currentIter);
}
}
}
}

You can call it using the following:

string filename = "C:\\myfile.txt";
StringBuilder sb = new StringBuilder();
foreach (int progress in LoadFileWithProgress(filename, sb))
{
// Update your progress counter here!
}
string fileData = sb.ToString();

As the file is loaded, the iterator will return the progress number from 0 to 100, which you can use to update your progress bar. Once the loop has finished, the StringBuilder will contain the contents of the text file.

Also, because you want text, we can just use BinaryReader to read in characters, which will ensure that your buffers line up correctly when reading any multi-byte characters (UTF-8, UTF-16, etc.).

This is all done without using background tasks, threads, or complex custom state machines.

You can improve read speed by using a BufferedStream, like this:

using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
while ((line = sr.ReadLine()) != null)
{


}
}

March 2013 UPDATE

I recently wrote code for reading and processing (searching for text in) 1 GB-ish text files (much larger than the files involved here) and achieved a significant performance gain by using a producer/consumer pattern. The producer task read in lines of text using the BufferedStream and handed them off to a separate consumer task that did the searching.

I used this as an opportunity to learn TPL Dataflow, which is very well suited for quickly coding this pattern.

Why BufferedStream is faster

A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. Buffers improve read and write performance. A buffer can be used for either reading or writing, but never both simultaneously. The Read and Write methods of BufferedStream automatically maintain the buffer.

December 2014 UPDATE: Your Mileage May Vary

Based on the comments, FileStream should be using a BufferedStream internally. At the time this answer was first provided, I measured a significant performance boost by adding a BufferedStream. At the time I was targeting .NET 3.x on a 32-bit platform. Today, targeting .NET 4.5 on a 64-bit platform, I do not see any improvement.

Related

I came across a case where streaming a large, generated CSV file to the Response stream from an ASP.Net MVC action was very slow. Adding a BufferedStream improved performance by 100x in this instance. For more see Unbuffered Output Very Slow

If you read the performance and benchmark stats on this website, you'll see that the fastest way to read (because reading, writing, and processing are all different) a text file is the following snippet of code:

using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
//do your stuff here
}
}

All up about 9 different methods were bench marked, but that one seem to come out ahead the majority of the time, even out performing the buffered reader as other readers have mentioned.

For binary files, the fastest way of reading them I have found is this.

 MemoryMappedFile mmf = MemoryMappedFile.CreateFromFile(file);
MemoryMappedViewStream mms = mmf.CreateViewStream();
using (BinaryReader b = new BinaryReader(mms))
{
}

In my tests it's hundreds of times faster.

All excellent answers! however, for someone looking for an answer, these appear to be somewhat incomplete.

As a standard String can only of Size X, 2Gb to 4Gb depending on your configuration, these answers do not really fulfil the OP's question. One method is to work with a List of Strings:

List<string> Words = new List<string>();


using (StreamReader sr = new StreamReader(@"C:\Temp\file.txt"))
{


string line = string.Empty;


while ((line = sr.ReadLine()) != null)
{
Words.Add(line);
}
}

Some may want to Tokenise and split the line when processing. The String List now can contain very large volumes of Text.

My file is over 13 GB: enter image description here

The bellow link contains the code that read a piece of file easily:

Read a large text file

More information

Whilst the most upvoted answer is correct but it lacks usage of multi-core processing. In my case, having 12 cores I use PLink:

Parallel.ForEach(
File.ReadLines(filename), //returns IEumberable<string>: lazy-loading
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
(line, state, index) =>
{
//process line value
}
);

Worth mentioning, I got that as an interview question asking return Top 10 most occurrences:

var result = new ConcurrentDictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
Parallel.ForEach(
File.ReadLines(filename),
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
(line, state, index) =>
{
result.AddOrUpdate(line, 1, (key, val) => val + 1);
}
);


return result
.OrderByDescending(x => x.Value)
.Take(10)
.Select(x => x.Value);

Benchmarking: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042 Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores [Host] : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT DefaultJob : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
GetTopWordsSync 33.03 s 0.175 s 0.155 s 1194000 314000 7000 7.06 GB
GetTopWordsParallel 10.89 s 0.121 s 0.113 s 1225000 354000 8000 7.18 GB

And as you can see it's 75% performance improvement.

But please note that the 7Gb is instantly loaded in the memory and since it's a blob it puts too much pressure on GC.

Its been more than 10 years since the last answers, This is my solution to read the text files of more than 10Gb and return the result following your required length. Putting here in case anyone seeking help :)

public static List<string> ReadFileNGetLine(string filepath, int lenghtLine)
{
List<string> listString = new List<string>();
try
{
StringBuilder resultAsString = new StringBuilder();


FileInfo info = new FileInfo(filepath);
if (info.Length < 10)
{
return listString;
}
using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(filepath))
using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, info.Length))
{
for (int i = 0; i < info.Length; i++)
{
//Reads a byte from a stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
if (listString.Count() >= lenghtLine)
{
break;
}
int result = memoryMappedViewStream.ReadByte();


if (result == -1)
{
break;
}


char letter = (char)result;
//khang: checking if the end of line is break line to collect full line
if ((letter.ToString() == "\r" || letter.ToString() == "\n") && letter.ToString() != "")
{
if (letter.ToString() != "\r")
{
listString.Add(resultAsString.ToString());
resultAsString.Clear();
}


}
else
{
resultAsString.Append(letter);
}


}
}
}
catch (Exception ex)
{
throw;
}
return listString;
}