在 Python 中将带 BOM 的 UTF-8转换为不带 BOM 的 UTF-8

我有两个问题。我有一套文件,通常是 UTF-8与 BOM。我想转换他们(理想的地方)到 UTF-8没有 BOM。看起来 codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors)会处理这个。但是我真的没有看到任何关于使用的好例子。这是解决问题的最好办法吗?

source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text

另外,如果我们能够处理不同的输入编码而不显式地知道(参见 ASCII 和 UTF-16) ,那将是理想的。看来这一切都是可行的。有没有一种解决方案,可以采用任何已知的 Python 编码和输出作为 UTF-8而不需要 BOM?

编辑1 从下面建议溶胶’n (谢谢!)

fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)

这给我带来了以下错误:

IOError: [Errno 9] Bad file descriptor

最新消息

有人在评论中告诉我,错误在于我用‘ rw’而不是‘ r +’/‘ r + b’模式打开文件,所以我最终应该重新编辑我的问题并删除已解决的部分。

156966 次浏览

Simply use the "utf-8-sig" codec:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

import os, sys, codecs


BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)


path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

import codecs
import shutil
import sys


s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
sys.stdout.write(s)


shutil.copyfileobj(sys.stdin, sys.stdout)

This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:

def utf8_converter(file_path, universal_endline=True):
'''
Convert any type of file to UTF-8 without BOM
and using universal endline by default.


Parameters
----------
file_path : string, file path.
universal_endline : boolean (True),
by default convert endlines to universal format.
'''


# Fix file path
file_path = os.path.realpath(os.path.expanduser(file_path))


# Read from file
file_open = open(file_path)
raw = file_open.read()
file_open.close()


# Decode
raw = raw.decode(chardet.detect(raw)['encoding'])
# Remove windows end line
if universal_endline:
raw = raw.replace('\r\n', '\n')
# Encode to UTF-8
raw = raw.encode('utf8')
# Remove BOM
if raw.startswith(codecs.BOM_UTF8):
raw = raw.replace(codecs.BOM_UTF8, '', 1)


# Write to file
file_open = open(file_path, 'w')
file_open.write(raw)
file_open.close()
return 0

You can use codecs.

import codecs
with open("test.txt",'r') as filehandle:
content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
print content.decode("utf-8")

In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:

s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)

I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.

For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of: File contains no section headers, please open the file like the following:

configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")

This could save you tons of effort by making the remove of the BOM header of the file unnecessary.

(I know this sounds unrelated, but hopefully this could help people struggling like me.)

In python3 you should add encoding='utf-8-sig':

with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
csvfile.writelines(rows)

that's it.