用 Python 散列文件

小开

DR 使用缓冲区不使用大量内存。

我相信，当我们考虑使用 非常大的文件对记忆的影响时，我们就会找到问题的症结所在。我们不希望这个坏男孩通过2G 的内存搅拌一个2G 的文件，所以，正如披萨指出，我们必须处理这些更大的文件块！

import sys
import hashlib


# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!


md5 = hashlib.md5()
sha1 = hashlib.sha1()


with open(sys.argv[1], 'rb') as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
md5.update(data)
sha1.update(data)


print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

我们所做的就是在使用 hashlib 的花花公子更新方法的同时，以64kb 的大小块更新这个坏小子的哈希表。这样一来，我们使用的内存就比一次性散列这个家伙所需要的2GB 要少得多！

您可以使用以下方法进行测试:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

希望能帮上忙！

所有这一切都在右边的相关问题中概述: 在 Python 中获取大文件的 MD5散列

附录！

一般来说，编写 python 有助于养成遵循 Pep-8的习惯。例如，在 python 中，变量通常是以下划线分隔的，而不是以 camelCased 分隔的。但那只是风格，没有人真正关心这些事情，除了那些不得不阅读糟糕风格的人... 也许就是几年后你阅读这段代码的人。

小开

为了正确有效地计算文件的 hash 值(在 Python 3中) :

以二进制模式开启档案(例如在档案模式中加入 'b') ，以避免字符编码和行结束转换的问题。
不要将完整的文件读入内存，因为那是对内存的浪费。相反，应该按顺序逐个块地读取它，并更新每个块的哈希值。
消除双缓冲，即不使用缓冲 IO，因为我们已经使用了最佳的块大小。
使用 readinto()避免缓冲区搅动。

例如:

import hashlib


def sha256sum(filename):
h  = hashlib.sha256()
b  = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
while n := f.readinto(mv):
h.update(mv[:n])
return h.hexdigest()

注意，while 循环使用的赋值表达式赋值表达式在3.8以上的 Python 版本中是不可用的。

对于旧版本的 Python 3，您可以使用等效的变体:

import hashlib


def sha256sum(filename):
h  = hashlib.sha256()
b  = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
for n in iter(lambda : f.readinto(mv), 0):
h.update(mv[:n])
return h.hexdigest()

小开

我已经编写了一个模块，它能够用不同的算法散列大文件。

pip3 install py_essentials

像这样使用模块:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

小开

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
print(h2,file=e)




with open("encrypted.txt","r") as e:
p = e.readline().strip()
print(p)

小开

我的建议很简单:

def get_digest(file_path):
h = hashlib.sha256()


with open(file_path, 'rb') as file:
while True:
# Reading is buffered, so we can read smaller chunks.
chunk = file.read(h.block_size)
if not chunk:
break
h.update(chunk)


return h.hexdigest()

这里所有其他的答案似乎太复杂了。Python 在读取时已经在缓冲了(以理想的方式，或者如果你有更多关于底层存储的信息，你可以配置缓冲) ，所以最好是以块的形式读取 hash 函数，这样可以使它更快或者至少更少的 CPU 密集度来计算 hash 函数。因此，不需要禁用缓冲并尝试自己模拟它，而是使用 Python 缓冲并控制应该控制的内容: 数据的使用者找到的理想内容，散列块大小。

小开

下面是一个 Python3，POSIX 解决方案(不是 Windows!) ，它使用 mmap将对象映射到内存中。

import hashlib
import mmap


def sha256sum(filename):
h  = hashlib.sha256()
with open(filename, 'rb') as f:
with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
h.update(mm)
return h.hexdigest()

小开

FWIW，我更喜欢这个版本，它具有与 maxschlepizig 的答案相同的内存和性能特征，但是更易于阅读 IMO:

import hashlib


def sha256sum(filename, bufsize=128 * 1024):
h = hashlib.sha256()
buffer = bytearray(bufsize)
# using a memoryview so that we can slice the buffer without copying it
buffer_view = memoryview(buffer)
with open(filename, 'rb', buffering=0) as f:
while True:
n = f.readinto(buffer_view)
if not n:
break
h.update(buffer_view[:n])
return h.hexdigest()

小开

您不需要定义一个包含5-20行代码的函数来完成这项工作！使用 Pathlib和 Hashlib库可以节省您的时间，Py _ 必需品也是另一种解决方案，但第三方是 * * * * * 。

from pathlib import Path
import hashlib


filepath = '/path/to/file'
filebytes = Path(filepath).read_bytes()


filehash_sha1 = hashlib.sha1(filebytes)
filehash_md5 = hashlib.md5(filebytes)


print(f'MD5: {filehash_md5}')
print(f'SHA1: {filehash_sha1}')

我在这里用了一些变量来展示步骤，你知道如何避免它。

你认为下面的函数怎么样？

from pathlib import Path
import hashlib




def compute_filehash(filepath: str, hashtype: str) -> str:
"""Computes the requested hash for the given file.


Args:
filepath: The path to the file to compute the hash for.
hashtype: The hash type to compute.


Available hash types:
md5, sha1, sha224, sha256, sha384, sha512, sha3_224,
sha3_256, sha3_384, sha3_512, shake_128, shake_256


Returns:
A string that represents the hash.
    

Raises:
ValueError: If the hash type is not supported.
"""
if hashtype not in ['md5', 'sha1', 'sha224', 'sha256', 'sha384',
'sha512', 'sha3_224', 'sha3_256', 'sha3_384',
'sha3_512', 'shake_128', 'shake_256']:
raise ValueError(f'Hash type {hashtype} is not supported.')
    

return getattr(hashlib, hashtype)(
Path(filepath).read_bytes()).hexdigest()