在 Python 中如何检测文件是否为二进制(非文本) ？

小开

你在 Unix 吗? 如果是，那么试试:

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

Shell 返回值是倒置的(0是可以的，所以如果它找到“ text”，那么它将返回0，在 Python 中，这是 False 表达式)。

小开

通常你得猜。

如果文件中有这些扩展名，你可以把它们看作一个线索。

您还可以识别已知的二进制格式，并忽略它们。

否则，请查看您拥有的非打印 ASCII 字节的比例，并从中进行猜测。

您还可以尝试从 UTF-8解码，看看是否能产生合理的输出。

小开

最佳答案

你也可以使用模仿秀模块:

import mimetypes
...
mime = mimetypes.guess_type(file)

It's fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.

小开

如果有帮助的话，许多二进制类型都以一个神奇的数字 Here is a list文件签名开始。

小开

下面是一个使用 Unix file命令的建议:

import re
import subprocess


def istext(path):
return (re.search(r':.* text',
subprocess.Popen(["file", '-L', path],
stdout=subprocess.PIPE).stdout.read())
is not None)

Example usage:

>>> istext('/etc/motd')
True
>>> istext('/vmlinuz')
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

它的缺点是不能移植到 Windows (除非你有类似于 file命令的东西) ，并且必须为每个文件产生一个外部进程，这可能是不可接受的。

小开

如果不在 Windows 上，可以使用 Python Magic来确定文件类型。然后您可以检查它是否为 text/mime 类型。

小开

试试这个:

def is_binary(filename):
"""Return true if the given filename is binary.
@raise EnvironmentError: if the file does not exist or cannot be accessed.
@attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
@author: Trent Mick <TrentM@ActiveState.com>
@author: Jorge Orpinel <jorge@orpinel.com>"""
fin = open(filename, 'rb')
try:
CHUNKSIZE = 1024
while 1:
chunk = fin.read(CHUNKSIZE)
if '\0' in chunk: # found null byte
return True
if len(chunk) < CHUNKSIZE:
break # done
# A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
finally:
fin.close()


return False

小开

我想最好的解决方案是使用 guess _ type 函数。它包含一个包含多个 imetype 的列表，您还可以包含自己的类型。下面是我为解决问题而写的剧本:

from mimetypes import guess_type
from mimetypes import add_type


def __init__(self):
self.__addMimeTypes()


def __addMimeTypes(self):
add_type("text/plain",".properties")


def __listDir(self,path):
try:
return listdir(path)
except IOError:
print ("The directory {0} could not be accessed".format(path))


def getTextFiles(self, path):
asciiFiles = []
for files in self.__listDir(path):
if guess_type(files)[0].split("/")[0] == "text":
asciiFiles.append(files)
try:
return asciiFiles
except NameError:
print ("No text files in directory: {0}".format(path))
finally:
del asciiFiles

它位于 Class 内部，正如您可以根据代码的结构看到的那样。但是您几乎可以更改您想要在应用程序中实现它的内容。 It`s quite simple to use. GetTextFiles 方法返回一个列表对象，其中包含您在 path 变量中传递的目录上的所有文本文件。

小开

还有一种方法基于文件(1)的行为:

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

例如:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

小开

我来这里寻找完全相同的东西——由标准库提供的检测二进制文本或文本的综合解决方案。在回顾了人们建议的选项之后，nix 文件命令看起来是最好的选择(我只为 linux boxen 开发)。其他一些人发布了使用文件的解决方案，但在我看来，它们没有必要那么复杂，所以我想到了以下几点:

def test_file_isbinary(filename):
cmd = shlex.split("file -b -e soft '{}'".format(filename))
if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
return False
return True

这是不言而喻的，但是调用这个函数的代码应该确保在测试之前可以读取一个文件，否则这将错误地检测到该文件为二进制文件。

小开

一个更短的解决方案，带有 UTF-16警告:

def is_binary(filename):
"""
Return true if the given filename appears to be binary.
File is considered to be binary if it contains a NULL byte.
FIXME: This approach incorrectly reports UTF-16 as binary.
"""
with open(filename, 'rb') as f:
for block in f:
if b'\0' in block:
return True
return False

小开

使用二进制还是不二进制库(GitHub)。

它非常简单，并且基于在这个堆栈溢出问题中找到的代码。

You can actually write this in 2 lines of code, however this package saves you from having to write and thoroughly test those 2 lines of code with all sorts of weird file types, cross-platform.

小开

如果在 utf-8中使用 python3，那么它是直接的，只需以文本模式打开文件，如果得到 UnicodeDecodeError，则停止处理。Python 3在处理文本模式下的文件(以及二进制模式下的 bytearray)时会使用 unicode ——如果您的编码不能解码任意文件，那么很可能会得到 UnicodeDecodeError。

例如:

try:
with open(filename, "r") as f:
for l in f:
process_line(l)
except UnicodeDecodeError:
pass # Fond non-text data

小开

如果文件包含空字符，大多数程序认为该文件是二进制文件(即任何不是“面向行”的文件)。

下面是用 Python 实现的 perl 版本的 pp_fttext()(pp_sys.c) :

import sys
PY3 = sys.version_info[0] == 3


# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr


_text_characters = (
b''.join(int2byte(i) for i in range(32, 127)) +
b'\n\r\t\f\b')


def istextfile(fileobj, blocksize=512):
""" Uses heuristics to guess whether the given file is text or binary,
by reading a single block of bytes from the file.
If more than 30% of the chars in the block are non-text, or there
are NUL ('\x00') bytes in the block, assume this is a binary file.
"""
block = fileobj.read(blocksize)
if b'\x00' in block:
# Files with null bytes are binary
return False
elif not block:
# An empty file is considered a valid text file
return True


# Use translate's 'deletechars' argument to efficiently remove all
# occurrences of _text_characters from the block
nontext = block.translate(None, _text_characters)
return float(len(nontext)) / len(block) <= 0.30

Note also that this code was written to run on both Python 2 and Python 3 without changes.

资料来源: 用 Python 实现了 Perl 的“猜测 file 是文本还是二进制文件”

小开

更简单的方法是使用 in运算符检查文件是否包含 NULL 字符(\x00) ，例如:

b'\x00' in open("foo.bar", 'rb').read()

见下面完整的例子:

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('file', nargs=1)
args = parser.parse_args()
with open(args.file[0], 'rb') as f:
if b'\x00' in f.read():
print('The file is binary!')
else:
print('The file is not binary!')

使用方法:

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

小开

Here's a function that first checks if the file starts with a BOM and if not looks for a zero byte within the initial 8192 bytes:

import codecs




#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
codecs.BOM_UTF16_BE,
codecs.BOM_UTF16_LE,
codecs.BOM_UTF32_BE,
codecs.BOM_UTF32_LE,
codecs.BOM_UTF8,
)




def is_binary_file(source_path):
with open(source_path, 'rb') as source_file:
initial_bytes = source_file.read(8192)
return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
and b'\0' in initial_bytes

Technically the check for the UTF-8 BOM is unnecessary because it should not contain zero bytes for all practical purpose. But as it is a very common encoding it's quicker to check for the BOM in the beginning instead of scanning all the 8192 bytes for 0.

小开

关于 * NIX:

If you have access to the `file` shell-command, shlex can help make the subprocess module more usable:

from os.path import realpath
from subprocess import check_output
from shlex import split


filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

或者，您也可以使用 for 循环获得当前目录中所有文件的输出:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

or for all subdirs:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
for afile in filelist:
assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

小开

我们可以使用 python 本身来检查一个文件是否是二进制文件，因为如果我们尝试在文本模式下打开二进制文件，它就会失败

def is_binary(file_name):
try:
with open(file_name, 'tr') as check_file:  # try open file in text mode
check_file.read()
return False
except:  # if fail then file is non-text (binary)
return True

小开

Try using the currently maintained 蟒蛇魔法 which is not the same module in @Kami Kisiel's answer. This does support all platforms including Windows however you will need the libmagic binary files. This is explained in the README.

与模仿秀模块不同，它不使用文件的扩展名，而是检查文件的内容。

>>> import magic
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'

小开

from binaryornot.check import is_binary
is_binary('filename')

文件

小开

所有这些基本方法都被合并到 Python 库中: 二进制还是不二进制.Install with pip。

根据文件:

>>> from binaryornot.check import is_binary
>>> is_binary('README.rst')
False

在 Python 中如何检测文件是否为二进制(非文本) ？

关于 * NIX:

If you have access to the file shell-command, shlex can help make the subprocess module more usable:

或者，您也可以使用 for 循环获得当前目录中所有文件的输出:

or for all subdirs:

If you have access to the `file` shell-command, shlex can help make the subprocess module more usable: