如何修正“ UnicodeDecodeError: ‘ charmap’codec 无法解码位置为29815的字节0x9d: 字符映射到 < 未定义 >”?

目前,我正在尝试让 Python 3程序通过 Spyder IDE/GUI 使用一个充满信息的文本文件进行一些操作。但是,当尝试读取文件时,我得到以下错误:

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
parser(f)


File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
data = infile.read()


File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]


UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

程序代码如下:

import os


os.getcwd()


import glob
import re
import sqlite3
import csv


def parser(file):


# Open a TXT file. Store all articles in a list. Each article is an item
# of the list. Split articles based on the location of such string as
# 'Document PRN0000020080617e46h00461'


articles = []
with open(file, 'r') as infile:
data = infile.read()
start = re.search(r'\n HD\n', data).start()
for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
end = m.end()
a = data[start:end].strip()
a = '\n   ' + a
articles.append(a)
start = end


# In each article, find all used Intelligence Indexing field codes. Extract
# content of each used field code, and write to a CSV file.


# All field codes (order matters)
fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']


for a in articles:
used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
fields_pos = []
for f in used:
f_m = re.search(r'\n   ' + f + r'\n', a)
f_pos = [f, f_m.start(), f_m.end()]
fields_pos.append(f_pos)
obs = []
n = len(used)
for i in range(0, n):
used_f = fields_pos[i][0]
start = fields_pos[i][2]
if i < n - 1:
end = fields_pos[i + 1][1]
else:
end = len(a)
content = a[start:end].strip()
obs.append(content)
for f in unused:
obs.insert(f[0], '')
obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
# print(obs)
cur.execute('''INSERT INTO articles
(id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
co, ina, ns, re, ipc, ipd, pub, an)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
?, ?, ?, ?, ?, ?, ?, ?)''', obs)


# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS articles')
# Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
cur.execute('''CREATE TABLE articles
(nid integer primary key, id text, hd text, cr text, wc text, pd text,
et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
ipd text, pub text, an text)''')
for f in glob.glob('*.txt'):
print(f)
parser(f)


# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
with conn:
cur = conn.cursor()
cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
colname = [desc[0] for desc in cur.description]
writer.writerow(colname)
for obs in cur.fetchall():
writer.writerow(obs)
160374 次浏览

https://en.wikipedia.org/wiki/Windows-1252中可以看到,CP1252中没有定义代码0x9D。

“错误”例如在 open函数中: 您没有指定编码,因此 python (仅在 windows 中)将使用一些系统编码。通常,如果您读取的文件可能不是在同一台计算机上创建的,那么指定编码确实更好。

我建议在你的 open编写 csv 的时候也加上一段代码。明确一点真的更好。

我不知道原始的文件格式,但添加到 open , encoding='utf-8'通常是一件好事(这是 Linux 和 MacO 中的默认格式)。

上面的方法对我不起作用,试试这个代替: , errors='ignore' 创造了奇迹!

错误 = “忽略”解决了我的头疼问题:

如何在目录和子目录中找到单词“昏迷”=

import os
rootdir=('K:\\0\\000.THU.EEG.nedc_tuh_eeg\\000edf.01_tcp_ar\\01_tcp_ar\\')
for folder, dirs, files in os.walk(rootdir):
for file in files:
if file.endswith('.txt'):
fullpath = os.path.join(folder, file)
with open(fullpath, 'r', errors='ignore') as f:
for line in f:
if "coma" in line:
print(fullpath)
break

在 open 语句中添加编码 例如:

f=open("filename.txt","r",encoding='utf-8')

如果不需要解码,也可以尝试将 file = open(filename, 'rb')‘ rb’转换为读二进制文件。如果你只是想上传到一个网站