如何从文本文件中删除非 UTF-8字符

我有一堆用 utf-8编码的阿拉伯语,英语,俄语文件。试图使用 Perl 脚本处理这些文件时,我得到了这个错误:

Malformed UTF-8 character (fatal)

手动检查这些文件的内容,我发现其中有一些奇怪的字符。 现在我正在寻找一种方法来自动删除这些文件中的字符。

还有别的办法吗?

120604 次浏览

iconv can do it

iconv -f cp1252 foo.txt

Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.

This command:

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

-f is the source format
-t the target format
-c skips any invalid sequence

None of the methods here or on any other similar questions worked for me. In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.

May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.