在 Python 中如何检测文件是否为二进制(非文本) ?

在 Python 中,如何判断文件是否为二进制(非文本) ?

我正在用 Python 搜索大量文件,并不断地在二进制文件中获得匹配。这使得输出看起来非常混乱。

我知道我可以使用 grep -I,但是我对数据做的比 grep 允许的更多。

In the past, I would have just searched for characters greater than 0x7f, but utf8 and the like, make that impossible on modern systems. Ideally, the solution would be fast.

93547 次浏览

你在 Unix 吗? 如果是,那么试试:

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

Shell 返回值是倒置的(0是可以的,所以如果它找到“ text”,那么它将返回0,在 Python 中,这是 False 表达式)。

通常你得猜。

如果文件中有这些扩展名,你可以把它们看作一个线索。

您还可以识别已知的二进制格式,并忽略它们。

否则,请查看您拥有的非打印 ASCII 字节的比例,并从中进行猜测。

您还可以尝试从 UTF-8解码,看看是否能产生合理的输出。

你也可以使用 模仿秀模块:

import mimetypes
...
mime = mimetypes.guess_type(file)

It's fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.

如果有帮助的话,许多二进制类型都以一个神奇的数字 Here is a list文件签名开始。

下面是一个使用 Unix file命令的建议:

import re
import subprocess


def istext(path):
return (re.search(r':.* text',
subprocess.Popen(["file", '-L', path],
stdout=subprocess.PIPE).stdout.read())
is not None)

Example usage:

>>> istext('/etc/motd')
True
>>> istext('/vmlinuz')
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

它的缺点是不能移植到 Windows (除非你有类似于 file命令的东西) ,并且必须为每个文件产生一个外部进程,这可能是不可接受的。

如果不在 Windows 上,可以使用 Python Magic来确定文件类型。然后您可以检查它是否为 text/mime 类型。

试试这个:

def is_binary(filename):
"""Return true if the given filename is binary.
@raise EnvironmentError: if the file does not exist or cannot be accessed.
@attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
@author: Trent Mick <TrentM@ActiveState.com>
@author: Jorge Orpinel <jorge@orpinel.com>"""
fin = open(filename, 'rb')
try:
CHUNKSIZE = 1024
while 1:
chunk = fin.read(CHUNKSIZE)
if '\0' in chunk: # found null byte
return True
if len(chunk) < CHUNKSIZE:
break # done
# A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
finally:
fin.close()


return False

我想最好的解决方案是使用 guess _ type 函数。它包含一个包含多个 imetype 的列表,您还可以包含自己的类型。 下面是我为解决问题而写的剧本:

from mimetypes import guess_type
from mimetypes import add_type


def __init__(self):
self.__addMimeTypes()


def __addMimeTypes(self):
add_type("text/plain",".properties")


def __listDir(self,path):
try:
return listdir(path)
except IOError:
print ("The directory {0} could not be accessed".format(path))


def getTextFiles(self, path):
asciiFiles = []
for files in self.__listDir(path):
if guess_type(files)[0].split("/")[0] == "text":
asciiFiles.append(files)
try:
return asciiFiles
except NameError:
print ("No text files in directory: {0}".format(path))
finally:
del asciiFiles

它位于 Class 内部,正如您可以根据代码的结构看到的那样。但是您几乎可以更改您想要在应用程序中实现它的内容。 It`s quite simple to use. GetTextFiles 方法返回一个列表对象,其中包含您在 path 变量中传递的目录上的所有文本文件。

还有一种方法 基于文件(1)的行为:

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

例如:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

我来这里寻找完全相同的东西——由标准库提供的检测二进制文本或文本的综合解决方案。在回顾了人们建议的选项之后,nix 文件命令看起来是最好的选择(我只为 linux boxen 开发)。其他一些人发布了使用 文件的解决方案,但在我看来,它们没有必要那么复杂,所以我想到了以下几点:

def test_file_isbinary(filename):
cmd = shlex.split("file -b -e soft '{}'".format(filename))
if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
return False
return True

这是不言而喻的,但是调用这个函数的代码应该确保在测试之前可以读取一个文件,否则这将错误地检测到该文件为二进制文件。

一个更短的解决方案,带有 UTF-16警告:

def is_binary(filename):
"""
Return true if the given filename appears to be binary.
File is considered to be binary if it contains a NULL byte.
FIXME: This approach incorrectly reports UTF-16 as binary.
"""
with open(filename, 'rb') as f:
for block in f:
if b'\0' in block:
return True
return False

使用 二进制还是不二进制库(GitHub)。

它非常简单,并且基于在这个堆栈溢出问题中找到的代码。

You can actually write this in 2 lines of code, however this package saves you from having to write and thoroughly test those 2 lines of code with all sorts of weird file types, cross-platform.

如果在 utf-8中使用 python3,那么它是直接的,只需以文本模式打开文件,如果得到 UnicodeDecodeError,则停止处理。Python 3在处理文本模式下的文件(以及二进制模式下的 bytearray)时会使用 unicode ——如果您的编码不能解码任意文件,那么很可能会得到 UnicodeDecodeError

例如:

try:
with open(filename, "r") as f:
for l in f:
process_line(l)
except UnicodeDecodeError:
pass # Fond non-text data

如果文件包含 空字符,大多数程序认为该文件是二进制文件(即任何不是“面向行”的文件)。

下面是用 Python 实现的 perl 版本的 pp_fttext()(pp_sys.c) :

import sys
PY3 = sys.version_info[0] == 3


# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr


_text_characters = (
b''.join(int2byte(i) for i in range(32, 127)) +
b'\n\r\t\f\b')


def istextfile(fileobj, blocksize=512):
""" Uses heuristics to guess whether the given file is text or binary,
by reading a single block of bytes from the file.
If more than 30% of the chars in the block are non-text, or there
are NUL ('\x00') bytes in the block, assume this is a binary file.
"""
block = fileobj.read(blocksize)
if b'\x00' in block:
# Files with null bytes are binary
return False
elif not block:
# An empty file is considered a valid text file
return True


# Use translate's 'deletechars' argument to efficiently remove all
# occurrences of _text_characters from the block
nontext = block.translate(None, _text_characters)
return float(len(nontext)) / len(block) <= 0.30

Note also that this code was written to run on both Python 2 and Python 3 without changes.

资料来源: 用 Python 实现了 Perl 的“猜测 file 是文本还是二进制文件”

更简单的方法是使用 in运算符检查文件是否包含 NULL 字符(\x00) ,例如:

b'\x00' in open("foo.bar", 'rb').read()

见下面完整的例子:

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('file', nargs=1)
args = parser.parse_args()
with open(args.file[0], 'rb') as f:
if b'\x00' in f.read():
print('The file is binary!')
else:
print('The file is not binary!')

使用方法:

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

Here's a function that first checks if the file starts with a BOM and if not looks for a zero byte within the initial 8192 bytes:

import codecs




#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
codecs.BOM_UTF16_BE,
codecs.BOM_UTF16_LE,
codecs.BOM_UTF32_BE,
codecs.BOM_UTF32_LE,
codecs.BOM_UTF8,
)




def is_binary_file(source_path):
with open(source_path, 'rb') as source_file:
initial_bytes = source_file.read(8192)
return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
and b'\0' in initial_bytes

Technically the check for the UTF-8 BOM is unnecessary because it should not contain zero bytes for all practical purpose. But as it is a very common encoding it's quicker to check for the BOM in the beginning instead of scanning all the 8192 bytes for 0.

关于 * NIX:

If you have access to the file shell-command, shlex can help make the subprocess module more usable:

from os.path import realpath
from subprocess import check_output
from shlex import split


filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

或者,您也可以使用 for 循环获得当前目录中所有文件的输出:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

or for all subdirs:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
for afile in filelist:
assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

我们可以使用 python 本身来检查一个文件是否是二进制文件,因为如果我们尝试在文本模式下打开二进制文件,它就会失败

def is_binary(file_name):
try:
with open(file_name, 'tr') as check_file:  # try open file in text mode
check_file.read()
return False
except:  # if fail then file is non-text (binary)
return True

Try using the currently maintained 蟒蛇魔法 which is not the same module in @Kami Kisiel's answer. This does support all platforms including Windows however you will need the libmagic binary files. This is explained in the README.

模仿秀模块不同,它不使用文件的扩展名,而是检查文件的内容。

>>> import magic
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'
from binaryornot.check import is_binary
is_binary('filename')

文件

所有这些基本方法都被合并到 Python 库中: 二进制还是不二进制.Install with pip。

根据文件:

>>> from binaryornot.check import is_binary
>>> is_binary('README.rst')
False