ansaurus

Question

searching for and matching elements across arrays

Answer 1

+1 A:

It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):

SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)

Given the desired semantics you can use the most straight forward approach:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

joins = []

for id, abstract in documents:
    for word in abstract.split():
        try:
            index = acronyms.index(word)
            joins.append((id, index))
        except ValueError:
            pass # word not an acronym

This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))    
joins = []

for id, abstract in documents:
    for word in abstract.split():
        try
            joins.append((id, index[word]))
        except KeyError:
            pass # word not an acronym

Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.

Aaron Maenpaa 2009-01-19 17:59:28

Answer 2

A:

Thanks a lot for the quick response. I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.

the second and the third are for Python I assume. Can I feed acronym and document as input files? babru

2009-01-19 19:22:47

Answer 3

A:

It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

2009-01-19 19:45:04

and, you know, it's pseudo code...

Aaron Maenpaa 2009-02-25 16:28:04

ansaurus

tags:

views:

answers:

searching for and matching elements across arrays

related questions