Python 读取文件并保存到 utf-8

我有问题从一个文件读取,处理其字符串和保存到 UTF-8文件。

密码如下:

try:
filehandle = open(filename,"r")
except:
print("Could not open file " + filename)
quit()


text = filehandle.read()
filehandle.close()

然后对变量 text 进行一些处理。

然后

try:
writer = open(output,"w")
except:
print("Could not open file " + output)
quit()


#data = text.decode("iso 8859-15")
#writer.write(data.encode("UTF-8"))
writer.write(text)
writer.close()

这个输出文件完美,但它这样做的 ISO 8859-15根据我的编辑器。由于同一个编辑器将输入文件(在变量 filename 中)识别为 UTF-8,我不知道为什么会发生这种情况。至于我的研究已经表明,注释行应该解决这个问题。然而,当我使用这些行时,结果文件主要是用特殊字符胡言乱语,文本是用西班牙语的波浪形单词。我真的很感激任何帮助,因为我被难住了... 。

298833 次浏览

You can't do that using open. use codecs.

when you are opening a file in python using the open built-in function you will always read/write the file in ascii. To write it in utf-8 try this:

import codecs
file = codecs.open('data.txt','w','utf-8')

Process text to and from Unicode at the I/O boundaries of your program using open with the encoding parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False) is the encoding used), so I recommend always explicitly using the encoding parameter for portability and clarity (Python 3 syntax below):

with open(filename, 'r', encoding='utf8') as f:
text = f.read()


# process Unicode text


with open(filename, 'w', encoding='utf8') as f:
f.write(text)

If still using Python 2 or for Python 2/3 compatibility, the io module implements open with the same semantics as Python 3's open and exists in both versions:

import io
with io.open(filename, 'r', encoding='utf8') as f:
text = f.read()


# process Unicode text


with io.open(filename, 'w', encoding='utf8') as f:
f.write(text)

You can also get through it by the code below:

file=open(completefilepath,'r',encoding='utf8',errors="ignore")
file.read()

The encoding parameter is what does the trick.

my_list = ['1', '2', '3', '4']
with open('test.txt', 'w', encoding='utf8') as file:
for i in my_list:
file.write(i + '\n')

You can try using utf-16, it might work.

data = pd.read_table(filename, encoding='utf-16', delimiter="\t")