views:

89

answers:

3

I have two tables.

In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.

The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.

I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.

Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.

Can some one with expertise in Python, Perl or any other scripting language help?

+1  A: 

It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):

SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)

Given the desired semantics you can use the most straight forward approach:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

joins = []

for id, abstract in documents:
    for word in abstract.split():
        try:
            index = acronyms.index(word)
            joins.append((id, index))
        except ValueError:
            pass # word not an acronym

This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))    
joins = []

for id, abstract in documents:
    for word in abstract.split():
        try
            joins.append((id, index[word]))
        except KeyError:
            pass # word not an acronym

Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.

Aaron Maenpaa
A: 

Thanks a lot for the quick response. I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.

the second and the third are for Python I assume. Can I feed acronym and document as input files? babru

A: 

It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

and, you know, it's pseudo code...
Aaron Maenpaa