Powershell Remove Duplicates Memory Issue

  • A+
Category:Languages

I'm currently using this command in Windows Powershell to remove duplicates from a simple 1 row CSV.

gc combine.csv | sort | get-unique > tags.cs 

Whenever I run it on a 150mb CSV (20 million row guessing) the task manager shows the Powershell eating up all available memory (32GB) and then using virtual memory. I also let the script run for about an hour as well and it didn't finish. I find that as strange because in excel it usually takes a few seconds to remove duplicates from my 1M row CSVS. Any suggestions on how to deal with this?

 


You could try:

Get-Content combine.csv -ReadCount 1000 |      foreach-object { $_ } |      Sort-Object -Unique |      Set-Content tags.cs 

(gc combine.csv -read 1kb | % { $_ } | sort -uniq | sc tags.cs)

But I think you'll hit the same problems. If you want faster results, and they don't need to be sorted they just need to be duplicate free:

$Lines = [System.Collections.Generic.HashSet[string]]::new()   $Lines.UnionWith([string[]][System.IO.File]::ReadAllLines('c:/path/to/combine.csv'))   [System.IO.File]::WriteAllLines('c:/path/to/tags.cs', $Lines) 

That ran on my test 20M random numbers file in 23 seconds and ~1.5GB memory. If they do need to be sorted, use SortedSet instead of HashSet, that ran in 5 minutes and <2GB memory. While your code is still running and currently passing 15GB.

Edit: tiberriver256 comments that [System.IO.File]::ReadLines instead of ReadAllLines can be streamed before the file has finished being read; it returns an enumerator rather than a final array of all lines. In the HashSet case this knocks runtime down a little from 12.5s to 11.5s - it varies too much to be sure, but it does seem to help.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: