ansaurus

Question

Searching a large list of words in another large list

Answer 1

A:

Sounds like something you should use a binary-tree for maybe.

Kingdom of Fish 2010-03-31 23:36:24

Answer 2

+1 A:

4 billion strings is a lot of strings to search.

You may be able to fit the entire data structure into a memory hash for fast lookup, but more likely you'd want to store the entire list on more spacious (but slower) disk in which case a sorted list would lend itself to the relatively efficient binary search algorithm.

If your binary search or such function was called find_string_in_articles(), then psuedocode:

foreach $protein_name ( @protein_names ) {
    if ( $article_id = find_string_in_articles( $protein_name ) ) {
        print( "$protein_name matches $article_id\n" );
    }
}

PP 2010-03-31 23:37:40

Most search algorithms on disk storage are horrendous performance-wise. Swap the collections so you can do lookup in memory on proteins, and sequential scan the article words.

Simon Buchan 2010-04-01 00:23:25

Answer 3

+1 A:

You could sort them and then do "mergesort" which would not actually merge but find you duplicates/overlaps. Wikipedia has good references on that.

Sorting that amount of data probably requires more memory than you have accessible. I don't know if unix sort (available on Windows/Mac too) can handle that, but any decent SQL database can do that.

Another possibility is to use a radix tree on your protein names (those starting with A go to bin A, B to bin B etc.). Then just loop over the 4 gazillion of words and locate overlaps (you probably must implement more than one deep radix binning to discard more proteins at a time).

Pasi Savolainen 2010-03-31 23:39:40

Answer 4

A:

I would go about this in 1 of 2 ways.

Insert it into a sql database and pull out the data you need (slower, but easier)
Sort the list, then do binary searches to find what you need (fast, but tricky)

Rook 2010-03-31 23:43:10

Answer 5

+1 A:

This is essentially a relational join. Assuming you don't have already sorted article words, your basic algorithm should be:

for word in article_words:
    if (proteins.find(word)):
        found_match(word)

proteins.find() is the difficult part, and you will have to experiment to get the best performance, this sort of problem is where cache effects start to come into play. I would first try with a radix sort, it's pretty simple and is likely fast enough, but binary searching and hashing are also alternatives.

Simon Buchan 2010-04-01 00:11:27

ansaurus

tags:

views:

answers:

Searching a large list of words in another large list

related questions