ansaurus

Question

Data persistence for python when a lot of lookups but few writes?

Answer 1

+1 A:

I would create a set of tuples, then pickle it to a file. The tuples would be (server, directory, file_name), or even just (server, full_file_name_including_directory). There's no need for a multiple-level data structure. The tuples will hash into the set, and give you a O(1) lookup.

You mention "stuff starts to slow down alot," but you don't say if it's reading and writing time, or lookup times that are slowing down. If your lookup times are slowing down, you may be paging. Is your data structure approaching a significant fraction of your physical memory?

One way to get back some memory is to intern() the server names. This way, each server name will be stored only once in memory.

An interesting alternative is to use a Bloom filter. This will let you use far less memory, but will occasionally download a file that you didn't have to. This might be a reasonable trade-off, depending on why you didn't want to download the file twice.

Ned Batchelder 2010-10-24 18:32:33

it is slowing down during lookup sorry i corrected the question.

UberJumper 2010-10-24 18:34:40

Having `server` in each tuple would not give you the dict-dict's capacity to get a quick overview on all servers and their respective files. Imagine you want to login once to a server and manipulate all its files...

eumiro 2010-10-24 18:35:52

@eumiro, I'm not going to imagine any new requirements. The OP said he needs to track duplicates.

Ned Batchelder 2010-10-24 18:37:53

@Ned - that's true. Let's see whether your proposal can help him.

eumiro 2010-10-24 18:46:01

This works amazingly. Thanks. I never thought to use the power of sets in this way.

UberJumper 2010-10-25 05:50:30

ansaurus

tags:

views:

answers:

Data persistence for python when a lot of lookups but few writes?

related questions