如何在 Python 中压缩字符串?

如何在 Python 中压缩字符串?

存在 gzip.GzipFile ,但这是针对文件对象的——那么使用普通字符串呢?

94628 次浏览

The easiest way is the zlib encoding:

compressed_value = s.encode("zlib")

Then you decompress it with:

plain_string_again = compressed_value.decode("zlib")

If you want to produce a complete gzip-compatible binary string, with the header etc, you could use gzip.GzipFile together with StringIO:

try:
from StringIO import StringIO  # Python 2.7
except ImportError:
from io import StringIO  # Python 3.x
import gzip
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()


# returns '\x1f\x8b\x08\x00\xbd\xbe\xe8N\x02\xff\x0b\xc9\xc8,V\x00\xa2\xdc\xcc\xecT\x85\xbc\xd2\xdc\xa4\xd4"\x85\xfc\xbcT\x1d\xa0X\x9ez\x89B\tH:Q!\'\xbfD!?M!\xad4\xcf\x1e\x00w\xd4\xea\xf41\x00\x00\x00'
s = "a long string of characters"


g = gzip.open('gzipfilename.gz', 'w', 5) # ('filename', 'read/write mode', compression level)
g.write(s)
g.close()

For those who want to compress a Pandas dataframe in JSON format:

Tested with Python 3.6 and Pandas 0.23

import sys
import zlib, lzma, bz2
import math


def convert_size(size_bytes):
if size_bytes == 0:
return "0B"
size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
i = int(math.floor(math.log(size_bytes, 1024)))
p = math.pow(1024, i)
s = round(size_bytes / p, 2)
return "%s %s" % (s, size_name[i])


dataframe = pd.read_csv('...') # your CSV file
dataframe_json = dataframe.to_json(orient='split')
data = dataframe_json.encode()
compressed_data = bz2.compress(data)
decompressed_data = bz2.decompress(compressed_data).decode()
dataframe_aux = pd.read_json(decompressed_data, orient='split')


#Original data size:  10982455 10.47 MB
#Encoded data size:  10982439 10.47 MB
#Compressed data size:  1276457 1.22 MB (lzma, slow), 2087131 1.99 MB (zlib, fast), 1410908 1.35 MB (bz2, fast)
#Decompressed data size:  10982455 10.47 MB
print('Original data size: ', sys.getsizeof(dataframe_json), convert_size(sys.getsizeof(dataframe_json)))
print('Encoded data size: ', sys.getsizeof(data), convert_size(sys.getsizeof(data)))
print('Compressed data size: ', sys.getsizeof(compressed_data), convert_size(sys.getsizeof(compressed_data)))
print('Decompressed data size: ', sys.getsizeof(decompressed_data), convert_size(sys.getsizeof(decompressed_data)))


print(dataframe.head())
print(dataframe_aux.head())

Python3 version of Sven Marnach's 2011 answer:

import gzip
exampleString = 'abcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijmortenpunnerudengelstadrocksklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuv123'
compressed_value = gzip.compress(bytes(exampleString, 'utf-8'))
plain_string_again = gzip.decompress(compressed_value).decode('utf-8')

Martin Thoma's answer almost worked: I had to use BytesIO as mentioned in this answer.

from io import BytesIO # Python 3.x, haven't tested 2.7
import gzip
out = BytesIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()

The original code produced a TypeError: string argument expected, got 'bytes'