I do have a time limit and would like to know what is the efficient way to scan file system remotely (talking about 50 millions of files in the extreme case)? The command dir takes ages (approx time 20 hours!!!).
Build a lookup table either locally or on the remote server, update it periodically, and search that. This is how the locate
command works on Unix. This is much, much faster (O(1) if you implement the lookup table as a hash) than traversing the file system each time you need to search for a file. The price you pay is that it is only as up-to-date as the last time you indexed the filesystem.
If you're reading the contents of 50 million+ files then, by definition, you are limited to the lowest of these three things:
- Remote I/O (disk)
- Network bandwidth;
- Local processing time (CPU)
If you're doing one file at a time you can speed it up by parallelizing the algorithm. Assuming it is optimally parallelized, you will be limited by one of the above.
(1) can only be addressed by scanning/reading less files. (2) can only be addressed by running on the remote host or by reducing the files you need to scan. (3) can only be addressed by increasing CPU, distributing the work and/or running on the remote system.
Reducing the workload can be done by a result of changing algorithm, changing requirements, caching results where appropriate or some combination thereof.
log into the server, dump the file listing like:
linux: $ ls > list.txt
windows: dir /b > list.txt
compress (remotely) list.txt with your favourite compressor and download it to the local system.
You can make a script to automate the task.