字频计数 Java8

如何计算 Java8中 List 的单词频率?

List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");

结果必须是:

{ciao=2, hello=1, bye=2}
79765 次浏览

我想分享我找到的解决方案,因为起初我希望使用 map-and-reduce 方法,但它有点不同。

Map<String,Long> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ));

或者对于整数值:

Map<String,Integer> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.summingInt(e -> 1) ));

剪辑

我添加了如何按值对地图进行排序:

LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(v1, v2) -> {
throw new IllegalStateException();
},
LinkedHashMap::new
));

(注意: 请参阅下面的编辑)

作为 Mounas 回答的一种替代方法,这里有一种并行计算单词数量的方法:

import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;


public class ParallelWordCount
{
public static void main(String[] args)
{
List<String> list = Arrays.asList(
"hello", "bye", "ciao", "bye", "ciao");
Map<String, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}

编辑为了回应评论,我用 JMH 运行了一个小测试,比较了 toConcurrentMapgroupingByConcurrent方法,输入列表大小不同,随机单词长度也不同。这个测试表明 toConcurrentMap方法更快。当考虑到这些方法在“引擎盖下”有多么不同时,很难预测这样的事情。

作为进一步的扩展,基于进一步的注释,我将测试扩展到涵盖了 toMapgroupingBy、串行和并行的所有四种组合。

结果仍然是 toMap方法更快,但出乎意料(至少对我来说) ,两种情况下的“并发”版本都比串行版本慢... ... :

             (method)  (count) (wordLength)  Mode  Cnt     Score    Error  Units
toConcurrentMap     1000            2  avgt   50   146,636 ±  0,880  us/op
toConcurrentMap     1000            5  avgt   50   272,762 ±  1,232  us/op
toConcurrentMap     1000           10  avgt   50   271,121 ±  1,125  us/op
toMap     1000            2  avgt   50    44,396 ±  0,541  us/op
toMap     1000            5  avgt   50    46,938 ±  0,872  us/op
toMap     1000           10  avgt   50    46,180 ±  0,557  us/op
groupingBy     1000            2  avgt   50    46,797 ±  1,181  us/op
groupingBy     1000            5  avgt   50    68,992 ±  1,537  us/op
groupingBy     1000           10  avgt   50    68,636 ±  1,349  us/op
groupingByConcurrent     1000            2  avgt   50   231,458 ±  0,658  us/op
groupingByConcurrent     1000            5  avgt   50   438,975 ±  1,591  us/op
groupingByConcurrent     1000           10  avgt   50   437,765 ±  1,139  us/op
toConcurrentMap    10000            2  avgt   50   712,113 ±  6,340  us/op
toConcurrentMap    10000            5  avgt   50  1809,356 ±  9,344  us/op
toConcurrentMap    10000           10  avgt   50  1813,814 ± 16,190  us/op
toMap    10000            2  avgt   50   341,004 ± 16,074  us/op
toMap    10000            5  avgt   50   535,122 ± 24,674  us/op
toMap    10000           10  avgt   50   511,186 ±  3,444  us/op
groupingBy    10000            2  avgt   50   340,984 ±  6,235  us/op
groupingBy    10000            5  avgt   50   708,553 ±  6,369  us/op
groupingBy    10000           10  avgt   50   712,858 ± 10,248  us/op
groupingByConcurrent    10000            2  avgt   50   901,842 ±  8,685  us/op
groupingByConcurrent    10000            5  avgt   50  3762,478 ± 21,408  us/op
groupingByConcurrent    10000           10  avgt   50  3795,530 ± 32,096  us/op

我对 JMH 没有太多的经验,也许我在这里做错了什么-建议和修正是受欢迎的:

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;


import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;


@State(Scope.Thread)
public class ParallelWordCount
{


@Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
public String method;


@Param({"2", "5", "10"})
public int wordLength;


@Param({"1000", "10000" })
public int count;


private List<String> list;


@Setup
public void initList()
{
list = createRandomStrings(count, wordLength, new Random(0));
}


@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{


if (method.equals("toMap"))
{
Map<String, Integer> counts =
list.stream().collect(
Collectors.toMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("toConcurrentMap"))
{
Map<String, Integer> counts =
list.parallelStream().collect(
Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("groupingBy"))
{
Map<String, Long> counts =
list.stream().collect(
Collectors.groupingBy(
Function.identity(), Collectors.<String>counting()));
bh.consume(counts);
}
else if (method.equals("groupingByConcurrent"))
{
Map<String, Long> counts =
list.parallelStream().collect(
Collectors.groupingByConcurrent(
Function.identity(), Collectors.<String> counting()));
bh.consume(counts);
}
}


private static String createRandomString(int length, Random random)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
int c = random.nextInt(26);
sb.append((char) (c + 'a'));
}
return sb.toString();
}


private static List<String> createRandomStrings(
int count, int length, Random random)
{
List<String> list = new ArrayList<String>(count);
for (int i = 0; i < count; i++)
{
list.add(createRandomString(length, random));
}
return list;
}
}

这个时间只适用于包含10000个元素和两个字母单词的列表的连续情况。

对于更大的列表大小,并发版本是否最终会超过串行版本是值得检查的,但是目前没有时间用所有这些配置运行另一个详细的基准测试。

如果使用 Eclipse 集合,只需将 List转换为 Bag即可。

Bag<String> words =
Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag();


Assert.assertEquals(2, words.occurrencesOf("ciao"));
Assert.assertEquals(1, words.occurrencesOf("hello"));
Assert.assertEquals(2, words.occurrencesOf("bye"));

您还可以直接使用 Bags工厂类创建 Bag

Bag<String> words =
Bags.mutable.with("hello", "bye", "ciao", "bye", "ciao");

这段代码将与 Java5 + 一起工作。

注意: 我是 Eclipse 集合的提交者

我将在这里展示我制作的解决方案(分组的解决方案要好得多:)。

static private void test0(List<String> input) {
Set<String> set = input.stream()
.collect(Collectors.toSet());
set.stream()
.collect(Collectors.toMap(Function.identity(),
str -> Collections.frequency(input, str)));
}

只有我的0.02美元

我的另一个2分钱,给一个数组:

import static java.util.stream.Collectors.*;


String[] str = {"hello", "bye", "ciao", "bye", "ciao"};
Map<String, Integer> collected
= Arrays.stream(str)
.collect(groupingBy(Function.identity(),
collectingAndThen(counting(), Long::intValue)));

下面是一种使用 map 函数创建频率图的方法。

List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList());
Map<String, Integer> frequencyMap = new HashMap<>();


words.forEach(word ->
frequencyMap.merge(word, 1, (v, newV) -> v + newV)
);


System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}

或者

words.forEach(word ->
frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1)
);

使用泛型查找集合中最常见的项:

private <V> V findMostFrequentItem(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
.entrySet()
.stream()
.max(Comparator.comparing(Entry::getValue))
.map(Entry::getKey)
.orElse(null);
}

计算项目频率:

private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
public class Main {


public static void main(String[] args) {




String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx";
long java8Case2 = testString.codePoints().filter(ch -> ch =='a').count();
System.out.println(java8Case2);


ArrayList<Character> list = new ArrayList<Character>();
for (char c : testString.toCharArray()) {
list.add(c);
}
Map<Object, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}


}

你可以使用 Java8流

    Arrays.asList(s).stream()
.collect(Collectors.groupingBy(Function.<String>identity(),
Collectors.<String>counting()));

我认为还有一种更易读的方法:

var words = List.of("my", "more", "more", "more", "simple", "way");
var count = words.stream().map(x -> Map.entry(x, 1))
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, Integer::sum));

与 map-reduce 方法类似,首先将每个单词 什么映射到 a (什么,1)。然后汇总(减少部分)所有对的计数(Map.Entry::getValue) ,其中他们的关键字(字 什么)是相似的,(Map.Entry::getKey)和计算的总和(Integer::sum)。

最终的终端操作将返回一个 HashMap<String, Integer>:

{more=3, simple=1, my=1, way=1}