Efficiently counting files in directory and subfolders with specific name

  • A+
Category:Languages

I can count all the files in a folder and sub-folders, the folders themselves are not counted.

(gci -Path *Fill_in_path_here* -Recurse -File | where Name -like "*STB*").Count 

However, powershell is too slow for the amount of files (up to 700k). I read that cmd is faster in executing this kind of task.

Unfortunately I have no knowledge of cmd code at all. In the example above I am counting all the files with STB in the file name.

That is what I would like to do in cmd as well.

Any help is appreciated.

 


Theo's helpful answer based on direct use of .NET ([System.IO.Directory]::EnumerateFiles()) is the fastest option.

Its limitations are:

  • An exception is thrown when an inaccessible directory is encountered (due to lack of permissions). You can catch the exception, but you cannot continue the enumeration; that is, you cannot robustly enumerate all items that you can access while ignoring those that you cannot.

  • Hidden items are invariably included.


A native PowerShell solution that addresses these limitations and is still reasonably fast is to use Get-ChildItem with the -Filter parameter.

(Get-ChildItem -LiteralPath $somePath -Filter *STB* -Recurse -File).Count 
  • Hidden items are excluded by default; add -Force to include them.

  • To ignore permission problems, add -ErrorAction SilentlyContinue or -ErrorAction Ignore; the advantage of SilentlyContinue is that you can later inspect the $Error collection to determine the specific errors that occurred, so as to ensure that the errors truly only stem from permission problems.

Note that while you can also pass wildcard arguments to -Path (the implied first positional parameter) and -Include (as in TobyU's answer), it is only -Filter that provides significant speed improvements, due to filtering at the source (the filesystem driver), so that PowerShell only receives the already-filtered results; by contrast, -Path / -Include must first enumerate everything and match against the wildcard pattern afterwards.[1]

Caveats re -Filter use:

  • Its wildcard language is not the same as PowerShell's; notably, it doesn't support character sets/ranges (e.g. *[0-9]) and it has legacy quirks - see this answer.
  • It only supports a single wildcard pattern, whereas -Include supports multiple (as an array).

That said, -Filter processes wildcards the same way as cmd.exe's dir.


Finally, for the sake of completeness, you can adapt MC ND's helpful answer based on cmd.exe's dir command for use in PowerShell, which simplifies matters:

(cmd /c dir /s /b /a-d "$somePath/*STB*").Count 

PowerShell captures an external program's stdout output as an array of lines, whose element count you can simply query with the .Count (or .Length) property.

That said, this may or may not be faster than PowerShell's own Get-ChildItem -Filter, depending on the filtering scenario; also note that dir /s can only ever return path strings, whereas Get-ChildItem returns rich objects whose properties you can query.

Caveats re dir use:

  • /a-d excludes directories, i.e., only reports files, but then also includes hidden files, which dir doesn't do by default.

  • dir /s invariably descends into hidden directories too during the recursive enumeration; an /a (attribute-based) filter is only applied to the leaf items of the enumeration (only to files in this case).

  • dir /s quietly ignores inaccessible directories.


Performance comparison:

  • The following tests compare pure enumeration performance without filtering, for simplicity, using a sizable directory tree assumed to be present on all systems, c:/windows/winsxs; that said, it's easy to adapt the tests to also compare filtering performance.

  • The tests are run from PowerShell, which means that some overhead is introduced by creating a child process for cmd.exe in order to invoke dir /s, though (a) that overhead should be relatively low and (b) the larger point is that staying in the realm of PowerShell is well worthwhile, given its vastly superior capabilities compared to cmd.exe.

  • The tests use function Time-Command, which can be downloaded from this Gist, which averages 10 runs by default.

# Warm up the filesystem cache for the target dir., # both from PowerShell and cmd.exe, to be safe. gci 'c:/windows/winsxs' -rec >$null; cmd /c dir /s 'c:/windows/winsxs' >$null  Time-Command `   { @([System.IO.Directory]::EnumerateFiles('c:/windows/winsxs', '*', 'AllDirectories')).Count },   { (Get-ChildItem -Force -Recurse -File 'c:/windows/winsxs').Count },   { (cmd /c dir /s /b /a-d 'c:/windows/winsxs').Count },   { cmd /c 'dir /s /b /a-d c:/windows/winsxs | find /c /v """"' } 

On my single-core VM with Windows PowerShell v5.1.17134.407 on Microsoft Windows 10 Pro (64-bit; Version 1803, OS Build: 17134.523) I get the following timings, from fastest to slowest (scroll to the right to see the Factor column to show relative performance):

Command                                                                                    Secs (10-run avg.) TimeSpan         Factor -------                                                                                    ------------------ --------         ------ @([System.IO.Directory]::EnumerateFiles('c:/windows/winsxs', '*', 'AllDirectories')).Count 11.016             00:00:11.0158660 1.00 (cmd /c dir /s /b /a-d 'c:/windows/winsxs').Count                                          15.128             00:00:15.1277635 1.37 cmd /c 'dir /s /b /a-d c:/windows/winsxs | find /c /v """"'                                16.334             00:00:16.3343607 1.48 (Get-ChildItem -Force -Recurse -File 'c:/windows/winsxs').Count                            24.525             00:00:24.5254979 2.23 

[1] In fact, due to an inefficient implementation as of Windows PowerShell v5.1 / PowerShell Core 6.2.0-preview.3, use of -Path and -Include is actually slower than using Get-ChildItem unfiltered and instead using an additional pipeline segment with ... | Where-Object Name -like *STB*, as in the OP - see this GitHub issue.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: