如何检测文本文件中无效的 utf8 Unicode/二进制

我需要检测损坏的文本文件,其中有无效(非 ASCII) utf-8,Unicode 或二进制字符。

�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o��������������ï
¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}��

我所尝试的:

iconv -f utf-8 -t utf-8 -c file.csv

这将文件从 utf-8编码转换为 utf-8编码,而 -c用于跳过无效的 utf-8字符。然而最终那些非法字符还是被打印了出来。在 linux 或其他语言的 bash 中是否有其他的解决方案?

106705 次浏览

这个 Perl 程序应该删除所有非 ASCII 字符:

 foreach $file (@ARGV) {
open(IN, $file);
open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
while (<IN>) {
s/[^[:ascii:]]//g;
print OUT "$_";
}
rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;
}

它的作用是将文件作为命令行上的输入,如下所示: < br > perl fixutf8.pl foo bar baz
然后,对于每一行,它将非 ASCII 字符的每个实例替换为空(删除)。
然后,它将这一行修改后的代码写到 super-temporary-utf8-replacement-file-which-should-never-be-used-EVER(这样命名就不会修改任何其他文件)
然后,它将临时文件重命名为原来的临时文件 它接受所有 ASCII 字符(包括 DEL、 NUL、 CR 等) ,以防您对它们有特殊用途。如果只需要可打印字符,只需在 s///中将 :ascii:替换为 :print:。< br > < br > 我希望这能有所帮助! 如果这不是你想要的,请告诉我。

我会 grep为非 ASCII 字符。

使用带 pcre 的 GNUgrep (由于 -P的原因,并不总是可用。在 FreeBSD 上,你可以在 pcre2包中使用 pcregrep,你可以这样做:

grep -P "[\x80-\xFF]" file

如何在 UNIX 中对所有非 ASCII 字符进行 grep中的参考文献。因此,实际上,如果只想检查文件是否包含非 ASCII 字符,只需说:

if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
#        ^
#        silent grep

要删除这些字符,可以使用:

sed -i.bak 's/[\d128-\d255]//g' file

这将创建一个 file.bak文件作为备份,而原始的 file将删除其非 ASCII 字符。从 csv 中删除非 ascii 字符中的参考文献。

你看到的是 根据定义损坏。显然,您正在显示用拉丁文 -1呈现的文件; 这三个字符 ï 1.2表示三个字节值0xEF 0xBF 0xBD。但是这些是 Unicode 替换字符 U + FFFD的 UTF-8编码,这是试图将字节从未知或未定义的编码转换成 UTF-8的结果,这将被正确地显示为(如果你有一个本世纪的浏览器,你应该看到一个类似黑色菱形的问号; 但这也取决于你使用的字体等)。

因此,你关于“如何检测”这种特殊现象的问题很容易,Unicode字符 U + FFfD 是一个死亡线索,也是你所暗示的过程中唯一可能出现的症状。

这些不是“无效的 Unicode”或“无效的 UTF-8”,因为这是一个有效的 UTF-8序列,编码一个有效的 Unicode字符,只是这个特定代码点的语义是“这是一个替换字符,一个字符不能被正确表示”,即无效的输入。

至于如何从一开始就阻止它,答案非常简单,但也没有什么信息——您需要确定错误编码发生的时间和方式,并修复产生无效输出的过程。

要删除 U + FFFD 字符,可以尝试这样做

perl -CSD -pe 's/\x{FFFD}//g' file

但是,正确的解决方案是首先不要产生这些错误的输出。

(没有显示示例数据的编码。它可能有一个 附加费损坏。如果您向我们展示的是 UTF-8数据呈现的拷贝/粘贴,那么它是“双重编码”的。换句话说,有人拿了已经损坏的 UTF-8文本,并告诉计算机将其从拉丁文1转换成 UTF-8。撤消这个操作很容易,只需将它“返回”到拉丁文 -1。然后,在多余的错误转换之前,您所获得的应该是原始的 UTF-8数据。)

我可能在重复别人已经说过的话。但我认为你的 无效字符仍然打印,因为他们可能是有效的。通用字符集是试图参考世界范围内经常使用的字符,以便能够编写健壮的软件,而不是依赖于一个特殊的字符集。

因此,我认为您的问题可能是以下两种情况之一——假设您的总体目标是处理来自 utf 文件的这种(恶意)输入:

  1. 无效 utf8字符(更好地称为 无效字节序列无效字节序列-对此,我想引用相应的 维基百科-文章)。
  2. 在你当前的显示字体中有 缺席等价物,它们被一个特殊的符号所替代,或者被显示为它们的二进制 ASCII 等价物(因此我想引用下面的帖子: UTF-8特殊字符不显示)。

因此,在我看来,你有两种可能的方法来处理这个问题:

  1. 变换 所有字符 从 utf8变成可处理的东西-f.e. ASCII-这可以用 f.e. 和 iconv -f utf-8 -t ascii -o file_in_ascii.txt file_in_utf8.txt来完成。但是当 小心从一个较宽的字符空间(utf)转移到一个较小的字符空间时,可能会导致数据丢失。
  2. 正确处理 utf (8) ——世界就是这样写东西的。如果您认为由于任何限制性的后处理步骤而必须依赖 ASCII-chars,那么停下来重新考虑一下。在大多数情况下,后处理器已经支持 utf,最好是找出如何利用它。你要把你的东西做成未来的,防弹的。

处理 utf 可能看起来很棘手,下面的步骤可以帮助你完成 utf-ready:

  • 能够正确地显示 utf,或者确保显示堆栈(OS、终端等)能够显示足够的 unicode 子集(当然,这应该满足您的需要) ,这在许多情况下可以避免使用十六进制编辑器。遗憾的是,utf 太大了,不能只有一种字体,但是一个好的开始点是这个 so-post: https://stackoverflow.com/questions/586503/complete-monospaced-unicode-font
  • 能够过滤无效的字节序列。有很多方法可以实现这一点,这篇 ul-post 展示了很多不同的方法: 过滤无效 utf8-我想特别指出第四个答案,它建议使用 uconv,它允许你为无效序列设置一个回调处理程序。
  • 阅读更多关于 Unicode 的内容。

Python 3中的一个非常肮脏的解决方案

import sys
with open ("cur.txt","r",encoding="utf-8") as f:
for i in f:
for c in i:
if(ord(c)<128):
print(c,end="")

产出应为:

>two_o~}}w~_^s?w}yo}

假设您的语言环境设置为 UTF-8(参见 locale输出) ,这对于识别无效的 UTF-8序列非常有效:

grep -axv '.*' file.txt

解说(来自 grep手册页) :

  • - a,—— text : 将文件视为文本,必须防止 grep 在发现无效字节序列(不是 utf8)时中止
  • 反相匹配 : 反相输出显示不匹配的行
  • - x’. *’ (—— line-regexp) : 表示匹配由任意 utf8字符组成的完整行。

因此,将会有输出,即包含无效的 not utf8字节序列的行,该序列包含行(因为是反的-v)

下面的 C 程序检测到无效的 utf8字符。 它在一个 linux 系统上进行了测试和使用。

/*
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.


You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.
*/


#include <stdio.h>
#include <stdlib.h>


void usage( void ) {
printf( "Usage: test_utf8 file ...\n" );


return;
}


int line_number = 1;
int char_number = 1;
char *file_name = NULL;


void inv_char( void ) {
printf( "%s: line : %d - char %d\n", file_name, line_number, char_number );


return;
}


int main( int argc, char *argv[]) {


FILE *out = NULL;
FILE *fh = NULL;


//    printf( "argc: %d\n", argc );


if( argc < 2 ) {
usage();
exit( 1 );
}


//    printf( "File: %s\n", argv[1] );


file_name = argv[1];


fh = fopen( file_name, "rb" );
if( ! fh ) {
printf( "Could not open file '%s'\n", file_name );
exit( 1 );
}


int utf8_type = 1;
int utf8_1 = 0;
int utf8_2 = 0;
int utf8_3 = 0;
int utf8_4 = 0;
int byte_count = 0;
int expected_byte_count = 0;


int cin = fgetc( fh );
while( ! feof( fh ) ) {
switch( utf8_type ) {
case 1:
if( (cin & 0x80) ) {
if( (cin & 0xe0) == 0xc0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 2;
break;
}


if( (cin & 0xf0) == 0xe0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 3;
break;
}


if( (cin & 0xf8) == 0xf0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 4;
break;
}


inv_char();
utf8_type = 1;
break;
}


break;


case 2:
case 3:
case 4:
//                printf( "utf8_type - %d\n", utf8_type );
//                printf( "%c - %02x\n", cin, cin );
if( (cin & 0xc0) == 0x80 ) {
if( utf8_type == expected_byte_count ) {
utf8_type = 1;
break;
}


byte_count = utf8_type;
utf8_type++;


if( utf8_type == 5 ) {
utf8_type = 1;
}


break;
}


inv_char();
utf8_type = 1;
break;


default:
inv_char();
utf8_type = 1;
break;
}


if( cin == '\n' ) {
line_number ++;
char_number = 0;
}


if( out != NULL ) {
fputc( cin, out );
}


//        printf( "lno: %d\n", line_number );


cin = fgetc( fh );
char_number++;
}


fclose( fh );


return 0;
}

尝试这样做,以便从 shell 中找到非 ASCII 字符。

命令:

$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt

产出:

2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不

... 我试图检测一个文件是否有损坏的字符。我也 有兴趣删除它们。

这对于 Ugrep来说很简单,只需要一句话:

ugrep -q -e "." -N "\p{Unicode}" file.csv && echo "file is corrupted"

删除无效 Unicode 字符:

ugrep "\p{Unicode}" --format="%o" file.csv

第一个命令匹配任何与 -e "."相匹配的字符,但有效的 Unicode 与 -N "\p{Unicode}"相匹配的字符除外,后者是一个“负模式”,可以跳过。

第二个命令匹配一个 Unicode字符 "\p{Unicode}",并用 --format="%o"写入它。