views:

44

answers:

1

Hi,

I am working on a project that basically monitors a set remote directories (FTP, networked paths, and another), if the file is considered new and meets criteria we download it and process it. However i am stuck on what the best way is to keep track of the files we already downloaded. I don't want to download any duplicate files, so i need to keep track of what is already downloaded.

Orignally i was storing it as a tree:

server->directory->file_name

When the service shuts down it writes it to a file, and rereads it back when it starts up. However given that when there is around 20,000 or so files in the tree stuff starts to slow down alot.

Is there a better way to do this?

EDIT

The lookup times start to slowdown alot, my basic implementation is a dict of a dict. The storing stuff on the disk is fine, its more or less just the lookup time. I know i can optimize the tree and partition it. However that seems excessive for such a small project i was hoping python would have something like that.

+1  A: 

I would create a set of tuples, then pickle it to a file. The tuples would be (server, directory, file_name), or even just (server, full_file_name_including_directory). There's no need for a multiple-level data structure. The tuples will hash into the set, and give you a O(1) lookup.

You mention "stuff starts to slow down alot," but you don't say if it's reading and writing time, or lookup times that are slowing down. If your lookup times are slowing down, you may be paging. Is your data structure approaching a significant fraction of your physical memory?

One way to get back some memory is to intern() the server names. This way, each server name will be stored only once in memory.

An interesting alternative is to use a Bloom filter. This will let you use far less memory, but will occasionally download a file that you didn't have to. This might be a reasonable trade-off, depending on why you didn't want to download the file twice.

Ned Batchelder
it is slowing down during lookup sorry i corrected the question.
UberJumper
Having `server` in each tuple would not give you the dict-dict's capacity to get a quick overview on all servers and their respective files. Imagine you want to login once to a server and manipulate all its files...
eumiro
@eumiro, I'm not going to imagine any new requirements. The OP said he needs to track duplicates.
Ned Batchelder
@Ned - that's true. Let's see whether your proposal can help him.
eumiro
This works amazingly. Thanks. I never thought to use the power of sets in this way.
UberJumper