tags:

views:

81

answers:

2

Hi, i'm facing regulars expressions for the first time and i need to extract some data from this report (a txt file with formatting info):

\n10: Vikelis M, Rapoport AM. Role of antiepileptic drugs as preventive agents for \nmigraine. CNS Drugs. 2010 Jan 1;24(1):21-33. doi:\n10.2165/11310970-000000000-00000. Review. PubMed PMID: 20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI. Antiepileptic\ndrugs in epilepsy and other disorders--a population-based study of prescriptions.\nEpilepsy Res. 2009 Nov;87(1):31-9. Epub 2009 Aug 13. PubMed PMID: 19679449.\n\n\n

As you can see all the txt's records begins with a number like "xx:" and always ends with "PubMed PMID: dddddddd. but using a RegEx like this:

regex = re.compile(r"^\d+: .+ PMID: \d{8}.$")
regex.findall(inputfile)

Gives me a list with one big string, so i'm misunderstanding something. How can i extract data from these records?

+2  A: 

Use .+? for non-greedy matching instead of .+ which gives you greedy matching. You also want a re.DOTALL to make sure your . matches the line-end characters it needs to match, and re.MULTILINE to make sure the ^ and $ match starts and ends of line, not just of the whole string. The options in question need to be joined with the "bit-OR" | operator and passed as the second argument to re.compile.

Alex Martelli
Applied and tested: `re.findall(r'(\d+): (.+?) PubMed PMID: (\d{8})', data, re.M | re.S)`
tux21b
What is greedy matching? It means the regex engine matches as much characters as it can. It is the default behavior. What happened to OP is that his regex matched the first occurence of "\d+: " then everything up to the last occurence of "\d{8}.", effectively matching the whole input text.
Philippe A.
Thanks for your help, "non-greedy matching" changes things a lot :) RegEx are a powerful tool indeed!
Gianluca Bargelli
@Gianluca, prego!-)
Alex Martelli
+1  A: 

If the records are as consistent as presented in your example, you don't need to use regular expressions. A simple partition of the text file into lists of tokens will do the trick. For instance:

txt = '\n10: Vikelis M, Rapoport AM. Role of antiepileptic drugs as preventive agents for \nmigraine. CNS Drugs. 2010 Jan 1;24(1):21-33. doi:\n10.2165/11310970-000000000-00000. Review. PubMed PMID: 20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI. Antiepileptic\ndrugs in epilepsy and other disorders--a population-based study of prescriptions.\nEpilepsy Res. 2009 Nov;87(1):31-9. Epub 2009 Aug 13. PubMed PMID: 19679449.\n\n\n'

lines = [token.replace('\n', '') for token in txt.split('.')]
for line in lines:
    print line

will print line by line each element of your references:

10: Vikelis M, Rapoport AM
 Role of antiepileptic drugs as preventive agents for migraine
 CNS Drugs
 2010 Jan 1;24(1):21-33
 doi:10
2165/11310970-000000000-00000
 Review
 PubMed PMID: 20030417
21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI
 Antiepilepticdrugs in epilepsy and other disorders--a population-based study of prescriptions
Epilepsy Res
 2009 Nov;87(1):31-9
 Epub 2009 Aug 13
 PubMed PMID: 19679449

Again, if you can trust that the first line of a record has the author; the second one the title, the third one the journal, etc, you may be able to do this very fast. If the information is a bit more "contextual" then you can START using regexp at this point.

Good luck.

Arrieta
Records are usually consistent even i can't be sure of this on a larger scale. Your solution is (much) faster than RegEx though, so i'll try both approaches and see pros/cons directly :) Thanks for your help!
Gianluca Bargelli