一次在一个字符串中替换多个不同的子字符串(或者以最有效的方式)

我需要以最有效的方式在一个字符串中替换许多不同的子字符串。 除了使用 string.place 替换每个字段这种蛮力方式之外,还有其他方法吗?

132619 次浏览

How about using the Replace All () method?

如果要多次更改 String,那么使用 StringBuilder (but measure your performance to find out)通常更有效:

String str = "The rain in Spain falls mainly on the plain";
StringBuilder sb = new StringBuilder(str);
// do your replacing in sb - although you'll find this trickier than simply using String
String newStr = sb.toString();

每次对 String 进行替换时,都会创建一个新的 String 对象,因为 String 是不可变的。StringBuilder 是可变的,也就是说,它可以随意更改。

StringBuilder将更有效地执行替换操作,因为它的字符数组缓冲区可以指定到所需的长度!

当然,真正的问题是,这种优化是否过头了?JVM 非常善于处理多个对象的创建和随后的垃圾收集,就像所有优化问题一样,我的第一个问题是您是否已经度量了这个问题并确定它是一个问题。

如果要操作的字符串很长,或者要操作很多字符串,那么使用 java.util.regex 是值得的。Matcher (这需要预先编译的时间,所以如果输入很小或者搜索模式经常变化,那么它就不会有效率)。

Below is a full example, based on a list of tokens taken from a map. (Uses StringUtils from Apache Commons Lang).

Map<String,String> tokens = new HashMap<String,String>();
tokens.put("cat", "Garfield");
tokens.put("beverage", "coffee");


String template = "%cat% really needs some %beverage%.";


// Create pattern of the format "%(cat|beverage)%"
String patternString = "%(" + StringUtils.join(tokens.keySet(), "|") + ")%";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(template);


StringBuffer sb = new StringBuffer();
while(matcher.find()) {
matcher.appendReplacement(sb, tokens.get(matcher.group(1)));
}
matcher.appendTail(sb);


System.out.println(sb.toString());

一旦正则表达式被编译,扫描输入字符串通常非常快(尽管如果您的正则表达式很复杂或涉及回溯,那么您仍然需要进行基准测试以确认这一点!)

看看这个:

String.format(str,STR[])

例如:

String.format( "Put your %s where your %s is", "money", "mouth" );

Rythm a java template engine now released with an new feature called 一个 href = “ http://www.playframework.org/module/rythm-1.0.0-20120630/String _ interputation”rel = “ nofollow”> String 插值模式 which allows you do something like:

String result = Rythm.render("@name is inviting you", "Diana");

上面的例子说明你可以通过位置将参数传递给模板。Rythm 还允许按名称传递参数:

Map<String, Object> args = new HashMap<String, Object>();
args.put("title", "Mr.");
args.put("name", "John");
String result = Rythm.render("Hello @title @name", args);

注意 Rythm 是 VERY FAST,比 String.format 和 velocity 快2到3倍,因为它将模板编译成 java 字节码,运行时性能非常接近于 StringBuilder 的串联。

相关网址:

public String replace(String input, Map<String, String> pairs) {
// Reverse lexic-order of keys is good enough for most cases,
// as it puts longer words before their prefixes ("tool" before "too").
// However, there are corner cases, which this algorithm doesn't handle
// no matter what order of keys you choose, eg. it fails to match "edit"
// before "bed" in "..bedit.." because "bed" appears first in the input,
// but "edit" may be the desired longer match. Depends which you prefer.
final Map<String, String> sorted =
new TreeMap<String, String>(Collections.reverseOrder());
sorted.putAll(pairs);
final String[] keys = sorted.keySet().toArray(new String[sorted.size()]);
final String[] vals = sorted.values().toArray(new String[sorted.size()]);
final int lo = 0, hi = input.length();
final StringBuilder result = new StringBuilder();
int s = lo;
for (int i = s; i < hi; i++) {
for (int p = 0; p < keys.length; p++) {
if (input.regionMatches(i, keys[p], 0, keys[p].length())) {
/* TODO: check for "edit", if this is "bed" in "..bedit.." case,
* i.e. look ahead for all prioritized/longer keys starting within
* the current match region; iff found, then ignore match ("bed")
* and continue search (find "edit" later), else handle match. */
// if (better-match-overlaps-right-ahead)
//   continue;
result.append(input, s, i).append(vals[p]);
i += keys[p].length();
s = i--;
}
}
}
if (s == lo) // no matches? no changes!
return input;
return result.append(input, s, hi).toString();
}

下面是基于 Todd Owen 的回答。这种解决方案存在一个问题: 如果替身计画包含在正则表达式中具有特殊意义的字符,就会得到意想不到的结果。我还希望能够有选择地进行不区分大小写的搜索。以下是我的想法:

/**
* Performs simultaneous search/replace of multiple strings. Case Sensitive!
*/
public String replaceMultiple(String target, Map<String, String> replacements) {
return replaceMultiple(target, replacements, true);
}


/**
* Performs simultaneous search/replace of multiple strings.
*
* @param target        string to perform replacements on.
* @param replacements  map where key represents value to search for, and value represents replacem
* @param caseSensitive whether or not the search is case-sensitive.
* @return replaced string
*/
public String replaceMultiple(String target, Map<String, String> replacements, boolean caseSensitive) {
if(target == null || "".equals(target) || replacements == null || replacements.size() == 0)
return target;


//if we are doing case-insensitive replacements, we need to make the map case-insensitive--make a new map with all-lower-case keys
if(!caseSensitive) {
Map<String, String> altReplacements = new HashMap<String, String>(replacements.size());
for(String key : replacements.keySet())
altReplacements.put(key.toLowerCase(), replacements.get(key));


replacements = altReplacements;
}


StringBuilder patternString = new StringBuilder();
if(!caseSensitive)
patternString.append("(?i)");


patternString.append('(');
boolean first = true;
for(String key : replacements.keySet()) {
if(first)
first = false;
else
patternString.append('|');


patternString.append(Pattern.quote(key));
}
patternString.append(')');


Pattern pattern = Pattern.compile(patternString.toString());
Matcher matcher = pattern.matcher(target);


StringBuffer res = new StringBuffer();
while(matcher.find()) {
String match = matcher.group(1);
if(!caseSensitive)
match = match.toLowerCase();
matcher.appendReplacement(res, replacements.get(match));
}
matcher.appendTail(res);


return res.toString();
}

下面是我的单元测试用例:

@Test
public void replaceMultipleTest() {
assertNull(ExtStringUtils.replaceMultiple(null, null));
assertNull(ExtStringUtils.replaceMultiple(null, Collections.<String, String>emptyMap()));
assertEquals("", ExtStringUtils.replaceMultiple("", null));
assertEquals("", ExtStringUtils.replaceMultiple("", Collections.<String, String>emptyMap()));


assertEquals("folks, we are not sane anymore. with me, i promise you, we will burn in flames", ExtStringUtils.replaceMultiple("folks, we are not winning anymore. with me, i promise you, we will win big league", makeMap("win big league", "burn in flames", "winning", "sane")));


assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abccbaabccba", makeMap("a", "b", "b", "c", "c", "a")));
assertEquals("bcaCBAbcCCBb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a")));
assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a"), false));


assertEquals("c colon  backslash temp backslash  star  dot  star ", ExtStringUtils.replaceMultiple("c:\\temp\\*.*", makeMap(".", " dot ", ":", " colon ", "\\", " backslash ", "*", " star "), false));
}


private Map<String, String> makeMap(String ... vals) {
Map<String, String> map = new HashMap<String, String>(vals.length / 2);
for(int i = 1; i < vals.length; i+= 2)
map.put(vals[i-1], vals[i]);
return map;
}

算法

替换匹配字符串(不使用正则表达式)的最有效方法之一是使用具有高性能 试试(读作“ try”)、快速 hashing算法和高效 collections实现的 Aho-Corasick 算法

简单密码

一个简单的解决方案利用了 Apache 的 StringUtils.replaceEach,如下所示:

  private String testStringUtils(
final String text, final Map<String, String> definitions ) {
final String[] keys = keys( definitions );
final String[] values = values( definitions );


return StringUtils.replaceEach( text, keys, values );
}

这会减慢大型文本的速度。

快速代码

Bor's implementation of the Aho-Corasick algorithm introduces a bit more complexity that becomes an implementation detail by using a façade with the same method signature:

  private String testBorAhoCorasick(
final String text, final Map<String, String> definitions ) {
// Create a buffer sufficiently large that re-allocations are minimized.
final StringBuilder sb = new StringBuilder( text.length() << 1 );


final TrieBuilder builder = Trie.builder();
builder.onlyWholeWords();
builder.removeOverlaps();


final String[] keys = keys( definitions );


for( final String key : keys ) {
builder.addKeyword( key );
}


final Trie trie = builder.build();
final Collection<Emit> emits = trie.parseText( text );


int prevIndex = 0;


for( final Emit emit : emits ) {
final int matchIndex = emit.getStart();


sb.append( text.substring( prevIndex, matchIndex ) );
sb.append( definitions.get( emit.getKeyword() ) );
prevIndex = emit.getEnd() + 1;
}


// Add the remainder of the string (contains no more matches).
sb.append( text.substring( prevIndex ) );


return sb.toString();
}

基准

For the benchmarks, the buffer was created using randomNumeric as follows:

  private final static int TEXT_SIZE = 1000;
private final static int MATCHES_DIVISOR = 10;


private final static StringBuilder SOURCE
= new StringBuilder( randomNumeric( TEXT_SIZE ) );

其中 MATCHES_DIVISOR规定了要注入的变量的数量:

  private void injectVariables( final Map<String, String> definitions ) {
for( int i = (SOURCE.length() / MATCHES_DIVISOR) + 1; i > 0; i-- ) {
final int r = current().nextInt( 1, SOURCE.length() );
SOURCE.insert( r, randomKey( definitions ) );
}
}

基准代码本身(JMH似乎有些过分) :

long duration = System.nanoTime();
final String result = testBorAhoCorasick( text, definitions );
duration = System.nanoTime() - duration;
System.out.println( elapsed( duration ) );

1,000,000:1,000

一个简单的微基准测试,其中包含1,000,000个字符和1,000个随机放置的字符串。

  • TestStringUtils: 25秒,25533毫秒
  • TestBorAhoCorasick: 0秒,68毫秒

毫无争议。

10,000:1,000

使用10,000个字符和1,000个匹配字符串替换:

  • TestStringUtils: 1秒,1402毫秒
  • TestBorAhoCorasick: 0秒,37毫秒

分水岭关闭了。

1000:10

Using 1,000 characters and 10 matching strings to replace:

  • TestStringUtils: 0秒,7毫秒
  • TestBorAhoCorasick: 0秒,19毫秒

对于短字符串,设置 Aho-Corasick 的开销超过了 StringUtils.replaceEach的蛮力方法。

一种基于文本长度的混合方法是可能的,以获得两种实现的最佳效果。

实施方案

Consider comparing other implementations for text longer than 1 MB, including:

证件

与算法有关的文件和信息:

这对我很有效:

String result = input.replaceAll("string1|string2|string3","replacementString");

例如:

String input = "applemangobananaarefruits";
String result = input.replaceAll("mango|are|ts","-");
System.out.println(result);

产量: 苹果-香蕉-水果-

简介: 单类实现 Dave 的答案,自动选择两种算法中效率最高的一种。

这是一个完整的、基于上述优秀 Dave Jarvis 的回答的单类实现。为了最大限度地提高效率,该类自动在两种不同的提供的算法之间进行选择。 (这个答案是为那些只想快速复制粘贴的人准备的。)

ReplaceStrings 类:

package somepackage


import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.ahocorasick.trie.Emit;
import org.ahocorasick.trie.Trie;
import org.ahocorasick.trie.Trie.TrieBuilder;
import org.apache.commons.lang3.StringUtils;


/**
* ReplaceStrings, This class is used to replace multiple strings in a section of text, with high
* time efficiency. The chosen algorithms were adapted from: https://stackoverflow.com/a/40836618
*/
public final class ReplaceStrings {


/**
* replace, This replaces multiple strings in a section of text, according to the supplied
* search and replace definitions. For maximum efficiency, this will automatically choose
* between two possible replacement algorithms.
*
* Performance note: If it is known in advance that the source text is long, then this method
* signature has a very small additional performance advantage over the other method signature.
* (Although either method signature will still choose the best algorithm.)
*/
public static String replace(
final String sourceText, final Map<String, String> searchReplaceDefinitions) {
final boolean useLongAlgorithm
= (sourceText.length() > 1000 || searchReplaceDefinitions.size() > 25);
if (useLongAlgorithm) {
// No parameter adaptations are needed for the long algorithm.
return replaceUsing_AhoCorasickAlgorithm(sourceText, searchReplaceDefinitions);
} else {
// Create search and replace arrays, which are needed by the short algorithm.
final ArrayList<String> searchList = new ArrayList<>();
final ArrayList<String> replaceList = new ArrayList<>();
final Set<Map.Entry<String, String>> allEntries = searchReplaceDefinitions.entrySet();
for (Map.Entry<String, String> entry : allEntries) {
searchList.add(entry.getKey());
replaceList.add(entry.getValue());
}
return replaceUsing_StringUtilsAlgorithm(sourceText, searchList, replaceList);
}
}


/**
* replace, This replaces multiple strings in a section of text, according to the supplied
* search strings and replacement strings. For maximum efficiency, this will automatically
* choose between two possible replacement algorithms.
*
* Performance note: If it is known in advance that the source text is short, then this method
* signature has a very small additional performance advantage over the other method signature.
* (Although either method signature will still choose the best algorithm.)
*/
public static String replace(final String sourceText,
final ArrayList<String> searchList, final ArrayList<String> replacementList) {
if (searchList.size() != replacementList.size()) {
throw new RuntimeException("ReplaceStrings.replace(), "
+ "The search list and the replacement list must be the same size.");
}
final boolean useLongAlgorithm = (sourceText.length() > 1000 || searchList.size() > 25);
if (useLongAlgorithm) {
// Create a definitions map, which is needed by the long algorithm.
HashMap<String, String> definitions = new HashMap<>();
final int searchListLength = searchList.size();
for (int index = 0; index < searchListLength; ++index) {
definitions.put(searchList.get(index), replacementList.get(index));
}
return replaceUsing_AhoCorasickAlgorithm(sourceText, definitions);
} else {
// No parameter adaptations are needed for the short algorithm.
return replaceUsing_StringUtilsAlgorithm(sourceText, searchList, replacementList);
}
}


/**
* replaceUsing_StringUtilsAlgorithm, This is a string replacement algorithm that is most
* efficient for sourceText under 1000 characters, and less than 25 search strings.
*/
private static String replaceUsing_StringUtilsAlgorithm(final String sourceText,
final ArrayList<String> searchList, final ArrayList<String> replacementList) {
final String[] searchArray = searchList.toArray(new String[]{});
final String[] replacementArray = replacementList.toArray(new String[]{});
return StringUtils.replaceEach(sourceText, searchArray, replacementArray);
}


/**
* replaceUsing_AhoCorasickAlgorithm, This is a string replacement algorithm that is most
* efficient for sourceText over 1000 characters, or large lists of search strings.
*/
private static String replaceUsing_AhoCorasickAlgorithm(final String sourceText,
final Map<String, String> searchReplaceDefinitions) {
// Create a buffer sufficiently large that re-allocations are minimized.
final StringBuilder sb = new StringBuilder(sourceText.length() << 1);
final TrieBuilder builder = Trie.builder();
builder.onlyWholeWords();
builder.ignoreOverlaps();
for (final String key : searchReplaceDefinitions.keySet()) {
builder.addKeyword(key);
}
final Trie trie = builder.build();
final Collection<Emit> emits = trie.parseText(sourceText);
int prevIndex = 0;
for (final Emit emit : emits) {
final int matchIndex = emit.getStart();


sb.append(sourceText.substring(prevIndex, matchIndex));
sb.append(searchReplaceDefinitions.get(emit.getKeyword()));
prevIndex = emit.getEnd() + 1;
}
// Add the remainder of the string (contains no more matches).
sb.append(sourceText.substring(prevIndex));
return sb.toString();
}


/**
* main, This contains some test and example code.
*/
public static void main(String[] args) {
String shortSource = "The quick brown fox jumped over something. ";
StringBuilder longSourceBuilder = new StringBuilder();
for (int i = 0; i < 50; ++i) {
longSourceBuilder.append(shortSource);
}
String longSource = longSourceBuilder.toString();
HashMap<String, String> searchReplaceMap = new HashMap<>();
ArrayList<String> searchList = new ArrayList<>();
ArrayList<String> replaceList = new ArrayList<>();
searchReplaceMap.put("fox", "grasshopper");
searchReplaceMap.put("something", "the mountain");
searchList.add("fox");
replaceList.add("grasshopper");
searchList.add("something");
replaceList.add("the mountain");
String shortResultUsingArrays = replace(shortSource, searchList, replaceList);
String shortResultUsingMap = replace(shortSource, searchReplaceMap);
String longResultUsingArrays = replace(longSource, searchList, replaceList);
String longResultUsingMap = replace(longSource, searchReplaceMap);
System.out.println(shortResultUsingArrays);
System.out.println("----------------------------------------------");
System.out.println(shortResultUsingMap);
System.out.println("----------------------------------------------");
System.out.println(longResultUsingArrays);
System.out.println("----------------------------------------------");
System.out.println(longResultUsingMap);
System.out.println("----------------------------------------------");
}
}

需要的 Maven 依赖:

(如果需要,将这些内容添加到您的 pom 文件中。)

    <!-- Apache Commons utilities. Super commonly used utilities.
https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.10</version>
</dependency>


<!-- ahocorasick, An algorithm used for efficient searching and
replacing of multiple strings.
https://mvnrepository.com/artifact/org.ahocorasick/ahocorasick -->
<dependency>
<groupId>org.ahocorasick</groupId>
<artifactId>ahocorasick</artifactId>
<version>0.4.0</version>
</dependency>