Python: 从压缩文本文件中读取行

有没有可能在不完全解压缩文件的情况下使用 Python 从 gzip 压缩的文本文件中读取一行?我有一个大约200MB 的 text.gz文件。当我提取它时,它变成了7.4 GB。这不是我要读的唯一一个文件。对于整个过程,我必须读取10个文件。虽然这将是一个连续的工作,我认为这将是一个明智的做法,而不提取整个信息。如何使用 Python 实现这一点?我需要逐行读取文本文件。

101650 次浏览

Have you tried using gzip.GzipFile? Arguments are similar to open.

You could use the standard gzip module in python. Just use:

gzip.open('myfile.gz')

to open the file as any other file and read its lines.

More information here: Python gzip module

Using gzip.GzipFile:

import gzip


with gzip.open('input.gz','rt') as f:
for line in f:
print('got line', line)

Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode). I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.

The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.

The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:

#!/usr/bin/python
import os
import sys
import time
import gzip


def local_unzip(obj):
t0 = time.time()
count = 0
with obj as f:
for line in f:
count += 1
print(time.time() - t0, count)


r = sys.argv[1]
if sys.argv[2] == "1":
local_unzip(gzip.open(r,'rt'))
else:
local_unzip(os.popen('pigz -dc ' + r))

Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:

$ /usr/bin/time -f %M ./unzip-file.py $file 1
(3037.2604110240936, 1223422024)
5116


$ /usr/bin/time -f %M ./unzip-file.py $file 2
(598.771901845932, 1223422024)
4996

Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.