ansaurus

Question

Python: question about parsing human-readable text

Answer 1

A:

There might be a regex parsing what you want, but I don't think it will be very readable/maintainable. My advice would be to use a parser generator like ANTLR. I think you'll have to throw the notion overboard that you can make the chemical descriptions a single token, much too complex. ANTLR even has a debugger so you can see why it's not parsing something you think it should, I don't think that's possible using regexps.

Regards,

Sebastiaan

Sebastiaan Megens 2009-07-20 12:19:11

There are very handy tools for debugging regex, like Regex Buddy.

Assaf Lavie 2009-07-20 12:29:42

strfriend is another option if you don't want to download anything: http://strfriend.com/

Steve Losh 2009-07-20 13:25:59

Answer 2

+2 A:

This will solve your current example. It can be tweaked for a larger data set.

import re
splitterForIndexing = re.compile(r"(?:[a-zA-Z0-9\-,]+[a-zA-Z0-9\-])|(?:[,.])")
source = "Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine is worse."
print "\n".join( splitterForIndexing.findall(source))

The result is:

"""
Hello
.
1-methyl-4-phenylpyridinium
is
ultra-bad
.
However
,
1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine
is
worse
.
"""

Sorry didn't see ultra-bad. If it's necessary for those words to be split..

import re
splitterForIndexing = re.compile(r"(?:[a-zA-Z]+)|(?:[a-zA-Z0-9][a-zA-Z0-9\-(),]+[a-zA-Z0-9\-()])|(?:[,.-])")
source = "Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,(2,3),6-tetrahydropyridine is worse."
print "\n".join( splitterForIndexing.findall(source))

Gives:

"""
Hello
.
1-methyl-4-phenylpyridinium
is
ultra
-
bad
.
However
,
1-methyl-4-phenyl-1,(2,3),6-tetrahydropyridine
is
worse
.
"""

Charles Beattie 2009-07-20 13:06:02

if you need brackets:re.compile(r"(?:[a-zA-Z0-9\-(),]+[a-zA-Z0-9\-()])|(?:[,.])")

Charles Beattie 2009-07-20 13:09:02

Regex is the way to go for this application: there are some out-of-the-box parsers but for extreme flexibility you will need to use regular expressions.

Andrew Sledge 2009-07-20 13:09:23

Answer 3

A:

I agree with Sebastiaan Megens that a regex solution may be possible, but probably not very readable or maintainable, especially if you are not already good with regular expressions. I would recommend the pyparsing module, if you're sticking with Python (which I think is a good choice).

Extra maintainability will come in very handy if your parsing needs should grow or change. (And I'm sure plenty of folks would say "when" rather than "if"! For example, someone already commented that you may need a more sophisticated notion of what needs to be allowed as a chemical name. Maybe your requirements are already changing before you've even chosen your tool!)

John Y 2009-07-21 02:51:46

ansaurus

tags:

views:

answers:

Python: question about parsing human-readable text

related questions