Fast way to find duplicate in large array

  • A+
Category:Languages

I have an array with 35k elements. How may I efficiently find duplicates and return those duplicates?

all = [*1..35000, 1] 

This solution works:

all.select { |v| all.count(v) > 1 } 

But it takes a long time.

 


Your code is taking an eon to execute because it is executing count for each element, resulting in it having a computational complexity of O(n2).

arr = [*1..35000, 1, 34999] 

If you want to know which values appear in the array at least twice...

require 'set'  uniq_set = Set.new arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a   #=> [1, 34999] 

Set lookups (implemented with a hash under the covers) are extremely fast.

See Set#add? and Set#add.

If you want to know the numbers of times values appear in the array that appear at least twice...

arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }   #=> {1=>2, 34999=>2} 

This uses a counting hash. See Hash::new when it takes a default value as an argument.

If you want to know the indices of values that appear in the array at least twice...

arr.each_with_index.     with_object({}) { |(x,i),h| (h[x] ||= []) << i }.     select { |_,v| v.size > 1 }   #=> {1=>[0, 35000], 34999=>[34998, 35001]} 

When the hash h does not already have a key x,

(h[x] ||= []) << i    #=> (h[x] = h[x] || []) << i    #=> (h[x] = nil || []) << i    #=> (h[x] = []) << i    #=> [] << i where [] is now h[x] 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: