tags:

views:

189

answers:

5

Hi,

I have some data which look like that:

PMID- 19587274
OWN - NLM
DP  - 2009 Jul 8
TI  - Domain general mechanisms of perceptual decision making in human cortex.
PG  - 8675-87
AB  - To successfully interact with objects in the environment, sensory evidence must
      be continuously acquired, interpreted, and used to guide appropriate motor
      responses. For example, when driving, a red 
AD  - Perception and Cognition Laboratory, Department of Psychology, University of
      California, San Diego, La Jolla, California 92093, USA.

PMID- 19583148
OWN - NLM
DP  - 2009 Jun
TI  - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic
      amyloidosis.
PG  - 482-6
AB  - BACKGROUND: Amyloidosis represents a group of different diseases characterized by
      extracellular accumulation of pathologic fibrillar proteins in various tissues
AD  - Asklepios Hospital, Department of Medicine, Langen, Germany.
      [email protected]

I want to write a regex which can match the sentences which follow PMID, TI and AB.

Is it possible to get these in a one shot regex?

I have spent nearly the whole day to try to figure out a regex and the closest I could get is that:

reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD'
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()

Which will return me the matches only in the second "set" of data, and not all of them.

Any idea? Thank you!

+1  A: 

How about:

import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI  - (?P<title>.*?)^PG|AB  - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
    print i.groupdict()

Output:

{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n      be continuously acquired, interpreted, and used to guide appropriate motor\n      responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n      amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n      extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}

Edit

As a verbose RE to make it more understandable (I think verbose REs should be used for anything but the simplest of expressions, but that's just my opinion!):

#!/usr/bin/python
import re
reg4 = re.compile(r'''
        ^                     # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        (?:                   # Non capturing group with multiple options, first option:
            PMID-\s           # Literal "PMID-" followed by a space
            (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        |                     # Next option:
            TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
            (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
            ^PG               # The characters PG at the start of a line
        |                     # Next option
            AB\s{2}-\s        # "AB  - "
            (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
            ^AD               # "AD" at the start of a line
        )
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

Note that you could replace the ^PG and ^AD with ^\S to make it more general (you want to match everything up until the first non-space at the start of a line).

Edit 2

If you want to catch the whole thing in one regexp, get rid of the starting (?:, the ending ) and change the | characters to .*?:

#!/usr/bin/python
import re
reg4 = re.compile(r'''
        ^                 # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        PMID-\s           # Literal "PMID-" followed by a space
        (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        .*?               # Next part:
        TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
        (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
        ^PG               # The characters PG at the start of a line
        .*?               # Next option
        AB\s{2}-\s        # "AB  - "
        (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
        ^AD               # "AD" at the start of a line
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

This gives:

{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n      be continuously acquired, interpreted, and used to guide appropriate motor\n      responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n      extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n      amyloidosis.\n'}
Al
Just to add, one of the issues with your original regex was probably too much usage of the greedy `.*` pattern, e-Jah - it was matching too much, and thus "greedily eating up" all of the records before the last as part of that greedy match, so you'd actually get the PMID of the first entry matched along with the abstract/title of the last entry (and all the rest would have gotten eaten in the very first match's first `.*` pattern).
Amber
A: 

The problem were the greedy qualifiers. Here's a regex that is more specific, and non-greedy:

#!/usr/bin/python
import re
from pprint import pprint
data = open("testdata.txt").read()

reg4 = r'''
   ^PMID               # Start matching at the string PMID
   \s*?-               # As little whitespace as possible up to the next '-'
   \s*?                # As little whitespcase as possible
   (?P<pmid>[0-9]+)    # Capture the field "pmid", accepting only numeric characters
   .*?TI               # next, match any character up to the first occurrence of 'TI'
   \s*?-               # as little whitespace as possible up to the next '-'
   \s*?                # as little whitespace as possible
   (?P<title>.*?)PG    # capture the field <title> accepting any character up the the next occurrence of 'PG'
   .*?AB               # match any character up to the following occurrence of 'AB'
   \s*?-               # As little whitespace as possible up to the next '-'
   \s*?                # As little whitespcase as possible
   (?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD'
'''
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE):
   print 78*"-"
   pprint(i.groupdict())

Output:

------------------------------------------------------------------------------
{'abstract': ' To successfully interact with objects in the environment,
   sensory evidence must\n      be continuously acquired, interpreted, and
   used to guide appropriate motor\n      responses. For example, when
   driving, a red \n',
 'pmid': '19587274',
 'title': ' Domain general mechanisms of perceptual decision making in
    human cortex.\n'}
------------------------------------------------------------------------------
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different
   diseases characterized by\n      extracellular accumulation of pathologic
   fibrillar proteins in various tissues\n',
 'pmid': '19583148',
 'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients
    with hepatic\n      amyloidosis.\n'}

You may want to strip the whitespace of each field after scanning.

exhuma
Just one point: this regexp will have problems if there is the text `PG` somewhere in the title or `AD` in the abstract. Adding the `^` start of line qualifier will fix this.
Al
Thanks @Al. Fixed it.
exhuma
A: 

Another regex:

reg4 = r'(?<=PMID- )(?P<pmid>.*?)(?=OWN - ).*?(?<=TI  - )(?P<title>.*?)(?=PG  - ).*?(?<=AB  - )(?P<abstract>.*?)(?=AD  - )'
Nick D
+2  A: 

How about not using regexps for this task, but instead using programmatic code that splits by newlines, looks at prefix codes using .startswith() etc? The code would be longer that way but everyone would be able to understand it, without having to come to stackoverflow for help.

Pēteris Caune
Having already answered the question with a long regular expression, I have to agree with Pēteris Caune: the `.startswith()` code style can end up being a little messy, but compared to the kind of complexity you need in a regular expression, it's way better. It's also far easier to understand. You may also find some existing parsers on the web that'll do the work for you...
Al
A: 

If the order of the lines can change, you could use this pattern:

reg4 = re.compile(r"""
    ^
    (?: PMID \s*-\s* (?P<pmid> [0-9]+ ) \n
     |  TI   \s*-\s* (?P<title> .* (?:\n[^\S\n].*)* ) \n
     |  AB   \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)* ) \n
     |  .+\n
     )+
""", re.MULTILINE | re.VERBOSE)

It will match consecutive non-empty lines, and capture the PMID, TI and AB items.

The item values can span multiple lines, as long as the lines following the first start with a whitespace character.

  • "[^\S\n]" matches any whitespace character (\s), except newline (\n).
  • ".* (?:\n[^\S\n].*)*" matches consecutive lines that start with a whitespace character.
  • ".+\n" matches any other non-empty line.
MizardX