如何计算 Ruby 数组中的重复元素

我有一个已排序的数组:

[
'FATAL <error title="Request timed out.">',
'FATAL <error title="Request timed out.">',
'FATAL <error title="There is insufficient system memory to run this query.">'
]

我想得到这样的东西,但它不一定是一个杂凑:

[
{:error => 'FATAL <error title="Request timed out.">', :count => 2},
{:error => 'FATAL <error title="There is insufficient system memory to run this query.">', :count => 1}
]
72699 次浏览

下面的代码 指纹是您要求的。我将让您决定如何实际使用它来生成您正在查找的散列:

# sample array
a=["aa","bb","cc","bb","bb","cc"]


# make the hash default to 0 so that += will work correctly
b = Hash.new(0)


# iterate over the array, counting duplicate entries
a.each do |v|
b[v] += 1
end


b.each do |k, v|
puts "#{k} appears #{v} times"
end

注意: 我刚注意到你说数组已经排序了。上面的代码不需要排序。使用该属性可以生成更快的代码。

a = [1,1,1,2,2,3]
a.uniq.inject([]){|r, i| r << { :error => i, :count => a.select{ |b| b == i }.size } }
=> [{:count=>3, :error=>1}, {:count=>2, :error=>2}, {:count=>1, :error=>3}]

通过使用 inject,您可以非常简洁地完成这项工作(一行) :

a = ['FATAL <error title="Request timed out.">',
'FATAL <error title="Request timed out.">',
'FATAL <error title="There is insufficient ...">']


b = a.inject(Hash.new(0)) {|h,i| h[i] += 1; h }


b.to_a.each {|error,count| puts "#{count}: #{error}" }

将产生:

1: FATAL <error title="There is insufficient ...">
2: FATAL <error title="Request timed out.">

简单实施:

(errors_hash = {}).default = 0
array_of_errors.each { |error| errors_hash[error] += 1 }

就我个人而言,我会这样做:

# myprogram.rb
a = ['FATAL <error title="Request timed out.">',
'FATAL <error title="Request timed out.">',
'FATAL <error title="There is insufficient system memory to run this query.">']
puts a

然后运行该程序并将其导入 uniq-c:

ruby myprogram.rb | uniq -c

产出:

 2 FATAL <error title="Request timed out.">
1 FATAL <error title="There is insufficient system memory to run this query.">

下面是样例数组:

a=["aa","bb","cc","bb","bb","cc"]
  1. 选择所有唯一键。
  2. 对于每个键,我们将它们累积到一个散列中,以得到类似于下面这样的结果: {'bb' => ['bb', 'bb']}
res = a.uniq.inject({}) {|accu, uni| accu.merge({ uni => a.select{|i| i == uni } })}
{"aa"=>["aa"], "bb"=>["bb", "bb", "bb"], "cc"=>["cc", "cc"]}

现在你可以做这样的事情:

res['aa'].size

如果你有这样的数组:

words = ["aa","bb","cc","bb","bb","cc"]

在需要计算重复元素的地方,一行解决方案是:

result = words.each_with_object(Hash.new(0)) { |word,counts| counts[word] += 1 }

使用 可枚举 # group _ by对上面的答案采用不同的方法。

[1, 2, 2, 3, 3, 3, 4].group_by(&:itself).map { |k,v| [k, v.count] }.to_h
# {1=>1, 2=>2, 3=>3, 4=>1}

将其分解为不同的方法调用:

a = [1, 2, 2, 3, 3, 3, 4]
a = a.group_by(&:itself) # {1=>[1], 2=>[2, 2], 3=>[3, 3, 3], 4=>[4]}
a = a.map { |k,v| [k, v.count] } # [[1, 1], [2, 2], [3, 3], [4, 1]]
a = a.to_h # {1=>1, 2=>2, 3=>3, 4=>1}

在 Ruby 1.8.7中添加了 Enumerable#group_by

下面这些怎么样:

things = [1, 2, 2, 3, 3, 3, 4]
things.uniq.map{|t| [t,things.count(t)]}.to_h

这种感觉更清晰,更能描述我们实际上正在做的事情。

我怀疑,对于大型集合,它的性能也会优于迭代每个值的集合。

基准表现测试:

a = (1...1000000).map { rand(100)}
user     system      total        real
inject                 7.670000   0.010000   7.680000 (  7.985289)
array count            0.040000   0.000000   0.040000 (  0.036650)
each_with_object       0.210000   0.000000   0.210000 (  0.214731)
group_by               0.220000   0.000000   0.220000 (  0.218581)

所以它要快得多。

如果你想经常使用这个,我建议你这样做:

# lib/core_extensions/array/duplicates_counter
module CoreExtensions
module Array
module DuplicatesCounter
def count_duplicates
self.each_with_object(Hash.new(0)) { |element, counter| counter[element] += 1 }.sort_by{|k,v| -v}.to_h
end
end
end
end

装上子弹

Array.include CoreExtensions::Array::DuplicatesCounter

然后使用从任何地方只:

the_ar = %w(a a a a a a a  chao chao chao hola hola mundo hola chao cachacho hola)
the_ar.duplicates_counter
{
"a" => 7,
"chao" => 4,
"hola" => 4,
"mundo" => 1,
"cachacho" => 1
}

从 Ruby > = 2.2可以使用 itself: array.group_by(&:itself).transform_values(&:count)

更多细节:

array = [
'FATAL <error title="Request timed out.">',
'FATAL <error title="Request timed out.">',
'FATAL <error title="There is insufficient system memory to run this query.">'
];


array.group_by(&:itself).transform_values(&:count)
=> { "FATAL <error title=\"Request timed out.\">"=>2,
"FATAL <error title=\"There is insufficient system memory to run this query.\">"=>1 }

使用 数不胜数

["a", "b", "c", "b"].tally


#=> { "a" => 1, "b" => 2, "c" => 1 }

注意: 仅适用于 Ruby 版本 > = 2.7

def find_most_occurred_item(arr)
return 'Array has unique elements already' if arr.uniq == arr
m = arr.inject(Hash.new(0)) { |h,v| h[v] += 1; h }
m.each do |k, v|
a = arr.max_by { |v| m[v] }
if v > a
puts "#{k} appears #{v} times"
elsif v == a
puts "#{k} appears #{v} times"
end
end
end


puts find_most_occurred_item([1, 2, 3,4,4,4,3,3])

因为 # tally 适用于2.7及以上版本,而我还没有完成,所以很容易在数组中使用 # count 方法。在数组上使用 # uniq 获取数组每个成员的一个副本,然后在数组中找到该成员的 # count:

counts=Hash.new
arr.uniq.each {|name| counts[name]=arr.count(name) }

例如:

arr = [ 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5]
arr.uniq => [1, 2, 3, 4, 5]
counts=Hash.new; arr.uniq.each {|name| counts[name]=arr.count(name) }

给了我们

counts => {1=>1, 2=>2, 3=>5, 4=>2, 5=>1}