How about:
import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
print i.groupdict()
Output:
{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}
Edit
As a verbose RE to make it more understandable (I think verbose REs should be used for anything but the simplest of expressions, but that's just my opinion!):
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
(?: # Non capturing group with multiple options, first option:
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
| # Next option:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
| # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
)
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
Note that you could replace the ^PG
and ^AD
with ^\S
to make it more general (you want to match everything up until the first non-space at the start of a line).
Edit 2
If you want to catch the whole thing in one regexp, get rid of the starting (?:
, the ending )
and change the |
characters to .*?
:
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
.*? # Next part:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
.*? # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
This gives:
{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}