Hello Stack-Overflow,
Here is my situation. I have a list of data that looks like this:
[(id__1_, description, id_type), (id__2_, description, id_type), ... , (id__n_, description, id_type))
The data is loaded from multiple files that all belong to the same grouping. In each grouping there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought a nice way to store all of this would be to throw it into a Set type. However there is a problem, sometimes for the same id the descriptions can vary slightly like this,
IPI00110753
- Tubulin alpha-1A chain
- Tubulin alpha-1 chain
- Alpha-tubulin 1
- Alpha-tubulin isotype M-alpha-1
(Note this example is taken from the uniprot protein database)
Now I don't care if the descriptions vary. Initially it might seem like I could just throw them away (because I could look them up in a database later). However I can't do this because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? If the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.