如何在 Java 中替换不可打印的 Unicode 字符?

下面的代码将替换 ASCII 控制字符([\x00-\x1F\x7F]的简写) :

my_string.replaceAll("\\p{Cntrl}", "?");

以下代码将替换所有 ASCII 不可打印字符([\p{Graph}\x20]的简写) ,包括重音字符:

my_string.replaceAll("[^\\p{Print}]", "?");

但是,这两种方法都不适用于 Unicode 字符串。是否有人有好的方法从 Unicode 字符串中删除不可打印的字符?

159543 次浏览

You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).

In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.

my_string.replaceAll("\\p{C}", "?");

See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.

Op De Cirkel is mostly right. His suggestion will work in most cases:

myString.replaceAll("\\p{C}", "?");

But if myString might contain non-BMP codepoints then it's more complicated. \p{C} contains the surrogate codepoints of \p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It's possible this is a Java bug rather than intended behavior.

Using the other constituent categories is an option:

myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");

However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \p{C}:

StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
int codePoint = myString.codePointAt(offset);
offset += Character.charCount(codePoint);


// Replace invisible control characters and unused code points
switch (Character.getType(codePoint))
{
case Character.CONTROL:     // \p{Cc}
case Character.FORMAT:      // \p{Cf}
case Character.PRIVATE_USE: // \p{Co}
case Character.SURROGATE:   // \p{Cs}
case Character.UNASSIGNED:  // \p{Cn}
newString.append('?');
break;
default:
newString.append(Character.toChars(codePoint));
break;
}
}

I have redesigned the code for phone numbers +9 (987) 124124 Extract digits from a string in Java

 public static String stripNonDigitsV2( CharSequence input ) {
if (input == null)
return null;
if ( input.length() == 0 )
return "";


char[] result = new char[input.length()];
int cursor = 0;
CharBuffer buffer = CharBuffer.wrap( input );
int i=0;
while ( i< buffer.length()  ) { //buffer.hasRemaining()
char chr = buffer.get(i);
if (chr=='u'){
i=i+5;
chr=buffer.get(i);
}


if ( chr > 39 && chr < 58 )
result[cursor++] = chr;
i=i+1;
}


return new String( result, 0, cursor );
}

methods below for your goal

public static String removeNonAscii(String str)
{
return str.replaceAll("[^\\x00-\\x7F]", "");
}


public static String removeNonPrintable(String str) // All Control Char
{
return str.replaceAll("[\\p{C}]", "");
}


public static String removeSomeControlChar(String str) // Some Control Char
{
return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}


public static String removeFullControlChar(String str)
{
return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}

I have used this simple function for this:

private static Pattern pattern = Pattern.compile("[^ -~]");
private static String cleanTheText(String text) {
Matcher matcher = pattern.matcher(text);
if ( matcher.find() ) {
text = text.replace(matcher.group(0), "");
}
return text;
}

Hope this is useful.

Based on the answers by Op De Cirkel and noackjr, the following is what I do for general string cleaning: 1. trimming leading or trailing whitespaces, 2. dos2unix, 3. mac2unix, 4. removing all "invisible Unicode characters" except whitespaces:

myString.trim.replaceAll("\r\n", "\n").replaceAll("\r", "\n").replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}&&[^\\s]]", "")

Tested with Scala REPL.

I propose it remove the non printable characters like below instead of replacing it

private String removeNonBMPCharacters(final String input) {
StringBuilder strBuilder = new StringBuilder();
input.codePoints().forEach((i) -> {
if (Character.isSupplementaryCodePoint(i)) {
strBuilder.append("?");
} else {
strBuilder.append(Character.toChars(i));
}
});
return strBuilder.toString();
}

Supported multilanguage

public static String cleanUnprintableChars(String text, boolean multilanguage)
{
String regex = multilanguage ? "[^\\x00-\\xFF]" : "[^\\x00-\\x7F]";
// strips off all non-ASCII characters
text = text.replaceAll(regex, "");


// erases all the ASCII control characters
text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");


// removes non-printable characters from Unicode
text = text.replaceAll("\\p{C}", "");


return text.trim();
}