views:

218

answers:

3

I do have a time limit and would like to know what is the efficient way to scan file system remotely (talking about 50 millions of files in the extreme case)? The command dir takes ages (approx time 20 hours!!!).

+1  A: 

Build a lookup table either locally or on the remote server, update it periodically, and search that. This is how the locate command works on Unix. This is much, much faster (O(1) if you implement the lookup table as a hash) than traversing the file system each time you need to search for a file. The price you pay is that it is only as up-to-date as the last time you indexed the filesystem.

ire_and_curses
A: 

If you're reading the contents of 50 million+ files then, by definition, you are limited to the lowest of these three things:

  1. Remote I/O (disk)
  2. Network bandwidth;
  3. Local processing time (CPU)

If you're doing one file at a time you can speed it up by parallelizing the algorithm. Assuming it is optimally parallelized, you will be limited by one of the above.

(1) can only be addressed by scanning/reading less files. (2) can only be addressed by running on the remote host or by reducing the files you need to scan. (3) can only be addressed by increasing CPU, distributing the work and/or running on the remote system.

Reducing the workload can be done by a result of changing algorithm, changing requirements, caching results where appropriate or some combination thereof.

cletus
Beyond just network bandwidth, network latency is often a more relevant issue too - often enough the network throughput may be high enough, but many roundtrips are needed (sometimes several per file, usually at least one per directory) since the client doesn't know what else to query until the first results have trickled in.
Eamon Nerbonne
+1  A: 

log into the server, dump the file listing like:

 linux: $ ls > list.txt
 windows: dir /b > list.txt

compress (remotely) list.txt with your favourite compressor and download it to the local system.

You can make a script to automate the task.

Pedro Ladaria
If there are millions of files, then they are probably in subdirectories. It would be useful to add recursive flags to your commands. Also, doesn't the OP mention dir, saying it is too slow?
ire_and_curses
dir is slow because it outputs to screen, the console output is a big bottleneck. Dumping to a file is as fast as the harddrive can. Those commands are examples. You can use whatever command you want, filter results (grep), use recursion or create a script which accepts the path as parameter...
Pedro Ladaria
my filesystem has around 500.000 files. Using "ls -R > list.txt" dumped the whole filesystem into a 6MB file in seconds. (R is for recursive)
Pedro Ladaria