Difference between `open` and `io.BytesIO` in binary streams

I'm learning about working with streams in Python and I noticed that the IO docs say the following:

The easiest way to create a binary stream is with open() with 'b' in the mode string:

f = open("myfile.jpg", "rb")

In-memory binary streams are also available as BytesIO objects:

f = io.BytesIO(b"some initial binary data: \x00\x01")

What is the difference between f as defined by open and f as defined by BytesIO. In other words, what makes a "In-memory binary stream" and how is that different from what open does?

124559 次浏览

使用 open在您的硬盘驱动器上打开一个文件。根据您使用的模式,您可以从磁盘读或写(或两者都读)。

A BytesIO object isn't associated with any real file on the disk. It's just a chunk of memory that behaves like a file does. It has the same API as a file object returned from open (with mode r+b, allowing reading and writing of binary data).

当您需要向 API 传递数据或从 API 传递数据时,BytesIO(它是始终处于文本模式的近亲 StringIO)非常有用,这些 API 希望获得一个文件对象,但是您更喜欢直接传递数据。在将输入数据交给库之前,您可以将已有的输入数据加载到 BytesIO中。在它返回之后,您可以使用 getvalue()方法从 BytesIO获得库写入文件的任何数据。(当然,通常你只需要做其中的一个。)

为了简单起见,现在让我们考虑写作而不是阅读。

所以当你使用 open()的时候,比如说:

with open("test.dat", "wb") as f:
f.write(b"Hello World")
f.write(b"Hello World")
f.write(b"Hello World")

执行该命令后,将创建一个名为 test.dat的文件,其中包含3 x Hello World。数据写入文件后不会保存在内存中(除非使用名称保存)。

现在考虑 io.BytesIO():

with io.BytesIO() as f:
f.write(b"Hello World")
f.write(b"Hello World")
f.write(b"Hello World")

它不是将内容写入文件,而是写入内存缓冲区。换句话说,一大块 RAM。从本质上讲,写下这些内容就相当于:

buffer = b""
buffer += b"Hello World"
buffer += b"Hello World"
buffer += b"Hello World"

In relation to the example with the with statement, then at the end there would also be a del buffer.

The key difference here is optimization and performance. io.BytesIO is able to do some optimizations that makes it faster than simply concatenating all the b"Hello World" one by one.

Just to prove it here's a small benchmark:

  • 1.3529秒
  • BytesIO: 0.0090秒

import io
import time


begin = time.time()
buffer = b""
for i in range(0, 50000):
buffer += b"Hello World"
end = time.time()
seconds = end - begin
print("Concat:", seconds)


begin = time.time()
buffer = io.BytesIO()
for i in range(0, 50000):
buffer.write(b"Hello World")
end = time.time()
seconds = end - begin
print("BytesIO:", seconds)

除了提高性能之外,使用 BytesIO代替连接还有一个优点,那就是可以使用 BytesIO代替文件对象。假设您有一个函数,它期望向其中写入一个文件对象。然后您可以给它一个内存缓冲区,而不是一个文件。

区别在于,open("myfile.jpg", "rb")只是加载并返回 myfile.jpg的内容; 而 BytesIO同样只是一个包含一些数据的缓冲区。

因为 BytesIO只是一个缓冲区——如果你想以后把内容写到一个文件中——你必须这样做:

buffer = io.BytesIO()
# ...
with open("test.dat", "wb") as f:
f.write(buffer.getvalue())

另外,您没有提到版本; 我使用的是 Python3。与示例相关: 我使用的是 with 语句,而不是调用 f.close()