如何将二进制文件读入无符号字符的向量

最近,我被要求编写一个函数,将二进制文件读入 std::vector<BYTE>,其中 BYTEunsigned char。我很快就想到了这样的东西:

#include <fstream>
#include <vector>
typedef unsigned char BYTE;


std::vector<BYTE> readFile(const char* filename)
{
// open the file:
std::streampos fileSize;
std::ifstream file(filename, std::ios::binary);


// get its size:
file.seekg(0, std::ios::end);
fileSize = file.tellg();
file.seekg(0, std::ios::beg);


// read the data:
std::vector<BYTE> fileData(fileSize);
file.read((char*) &fileData[0], fileSize);
return fileData;
}

这似乎是不必要的复杂和明确的转换到 char*,我被迫使用时,呼叫 file.read并没有让我感觉更好。


另一种选择是使用 std::istreambuf_iterator:

std::vector<BYTE> readFile(const char* filename)
{
// open the file:
std::ifstream file(filename, std::ios::binary);


// read the data:
return std::vector<BYTE>((std::istreambuf_iterator<char>(file)),
std::istreambuf_iterator<char>());
}

这是相当简单和短,但仍然我必须使用的 std::istreambuf_iterator<char>,甚至当我读入 std::vector<unsigned char>


最后一个看起来非常简单的选项是使用 std::basic_ifstream<BYTE>,它明确地表达了 “我想要一个输入文件流,我想用它来读取 BYTE:

std::vector<BYTE> readFile(const char* filename)
{
// open the file:
std::basic_ifstream<BYTE> file(filename, std::ios::binary);


// read the data:
return std::vector<BYTE>((std::istreambuf_iterator<BYTE>(file)),
std::istreambuf_iterator<BYTE>());
}

但我不确定 basic_ifstream在这种情况下是否是一个合适的选择。

将二进制文件读入 vector的最佳方法是什么?我还想知道 “幕后黑手”发生了什么,以及我可能遇到的问题是什么(除了流没有被正确打开,这可以通过简单的 is_open检查来避免)。

在这里使用 std::istreambuf_iterator有什么好的理由吗?
(我能看到的唯一优势就是简单)

101915 次浏览

Since you are loading the entire file into memory the most optimal version is to map the file into memory. This is because the kernel loads the file into kernel page cache anyway and by mapping the file you just expose those pages in the cache into your process. Also known as zero-copy.

When you use std::vector<> it copies the data from the kernel page cache into std::vector<> which is unnecessary when you just want to read the file.

Also, when passing two input iterators to std::vector<> it grows its buffer while reading because it does not know the file size. When resizing std::vector<> to the file size first it needlessly zeroes out its contents because it is going to be overwritten with file data anyway. Both of the methods are sub-optimal in terms of space and time.

I would have thought that the first method, using the size and using stream::read() would be the most efficient. The "cost" of casting to char * is most likely zero - casts of this kind simply tell the compiler that "Hey, I know you think this is a different type, but I really want this type here...", and does not add any extra instrucitons - if you wish to confirm this, try reading the file into a char array, and compare the actual assembler code. Aside from a little bit of extra work to figure out the address of the buffer inside the vector, there shouldn't be any difference.

As always, the only way to tell for sure IN YOUR CASE what is the most efficient is to measure it. "Asking on the internet" is not proof.

When testing for performance, I would include a test case for:

std::vector<BYTE> readFile(const char* filename)
{
// open the file:
std::ifstream file(filename, std::ios::binary);


// Stop eating new lines in binary mode!!!
file.unsetf(std::ios::skipws);


// get its size:
std::streampos fileSize;


file.seekg(0, std::ios::end);
fileSize = file.tellg();
file.seekg(0, std::ios::beg);


// reserve capacity
std::vector<BYTE> vec;
vec.reserve(fileSize);


// read the data:
vec.insert(vec.begin(),
std::istream_iterator<BYTE>(file),
std::istream_iterator<BYTE>());


return vec;
}

My thinking is that the constructor of Method 1 touches the elements in the vector, and then the read touches each element again.

Method 2 and Method 3 look most promising, but could suffer one or more resize's. Hence the reason to reserve before reading or inserting.

I would also test with std::copy:

...
std::vector<byte> vec;
vec.reserve(fileSize);


std::copy(std::istream_iterator<BYTE>(file),
std::istream_iterator<BYTE>(),
std::back_inserter(vec));

In the end, I think the best solution will avoid operator >> from istream_iterator (and all the overhead and goodness from operator >> trying to interpret binary data). But I don't know what to use that allows you to directly copy the data into the vector.

Finally, my testing with binary data is showing ios::binary is not being honored. Hence the reason for noskipws from <iomanip>.

std::ifstream stream("mona-lisa.raw", std::ios::in | std::ios::binary);
std::vector<uint8_t> contents((std::istreambuf_iterator<char>(stream)), std::istreambuf_iterator<char>());


for(auto i: contents) {
int value = i;
std::cout << "data: " << value << std::endl;
}


std::cout << "file size: " << contents.size() << std::endl;

The class below extends vector with a binary file load and save. I returned to this question multiple times already, so this is the code for my next return - and for all others who will be looking for the binary file save method next. :)

#include <cinttypes>
#include <fstream>
#include <vector>


class FileVector : public std::vector<uint8_t>
{
public:


using std::vector<uint8_t>::vector;


void loadFromFile(const char *filename)
{
std::ifstream file(filename, std::ios::in | std::ios::binary);
insert(begin(),
std::istream_iterator<uint8_t>(file),
std::istream_iterator<uint8_t>());
}


void saveTofile(const char *filename) const
{
std::ofstream file(filename, std::ios::out | std::ios::binary);
file.write((const char *) data(), size());
file.close();
}
};

NOTE: For load optimization please consider determining file size and pre-allocating required space as mentioned in other comments here.