tags:

views:

288

answers:

3

Hi all,

I'm parsing human-readable scientific text that is mostly in the field of chemistry. What I'm interested in is breaking the text into a list of words, scientific terms (more on that below), and punctuation marks.

So for example, I expect the text "hello, world." to break into 4 tokens: 1) "hello"; 2) comma; 3) "world" and 4) period. Note that spaces don't require specialized tokens.

The problem is related to the "scientific terms": these are names of chemical formulas such as "1-methyl-4-phenylpyridinium". Anyone who has ever learned chemistry knows these formulas can get quite long and may contain numbers, dashes and commas, and sometimes even parentheses, but I think it's safe to assume these lovely expressions can't contain spaces. Also, I believe these expressions must start with a number. I would like each such expression to come out as a single token.

Today I use manual parsing to find "chunks" of text that begin with a number and end with either a space, a line break, or a punctuation mark followed by either a space or line break.

I wondered if there's a smart solution (regex or other) I can use to tokenize the text according to the above specifications. I'm working in Python but this may be language agnostic.

An example input (obviously disregard the content...):

"Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine is worse."

Example output (each token in its own line):

Hello
.
1-methyl-4-phenylpyridinium
is
ultra
-
bad
.
However
,
1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine
is
worse
.
A: 

There might be a regex parsing what you want, but I don't think it will be very readable/maintainable. My advice would be to use a parser generator like ANTLR. I think you'll have to throw the notion overboard that you can make the chemical descriptions a single token, much too complex. ANTLR even has a debugger so you can see why it's not parsing something you think it should, I don't think that's possible using regexps.

Regards,

Sebastiaan

Sebastiaan Megens
There are very handy tools for debugging regex, like Regex Buddy.
Assaf Lavie
strfriend is another option if you don't want to download anything: http://strfriend.com/
Steve Losh
+2  A: 

This will solve your current example. It can be tweaked for a larger data set.

import re
splitterForIndexing = re.compile(r"(?:[a-zA-Z0-9\-,]+[a-zA-Z0-9\-])|(?:[,.])")
source = "Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine is worse."
print "\n".join( splitterForIndexing.findall(source))

The result is:

"""
Hello
.
1-methyl-4-phenylpyridinium
is
ultra-bad
.
However
,
1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine
is
worse
.
"""

Sorry didn't see ultra-bad. If it's necessary for those words to be split..

import re
splitterForIndexing = re.compile(r"(?:[a-zA-Z]+)|(?:[a-zA-Z0-9][a-zA-Z0-9\-(),]+[a-zA-Z0-9\-()])|(?:[,.-])")
source = "Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,(2,3),6-tetrahydropyridine is worse."
print "\n".join( splitterForIndexing.findall(source))

Gives:

"""
Hello
.
1-methyl-4-phenylpyridinium
is
ultra
-
bad
.
However
,
1-methyl-4-phenyl-1,(2,3),6-tetrahydropyridine
is
worse
.
"""
Charles Beattie
if you need brackets:re.compile(r"(?:[a-zA-Z0-9\-(),]+[a-zA-Z0-9\-()])|(?:[,.])")
Charles Beattie
Regex is the way to go for this application: there are some out-of-the-box parsers but for extreme flexibility you will need to use regular expressions.
Andrew Sledge
A: 

I agree with Sebastiaan Megens that a regex solution may be possible, but probably not very readable or maintainable, especially if you are not already good with regular expressions. I would recommend the pyparsing module, if you're sticking with Python (which I think is a good choice).

Extra maintainability will come in very handy if your parsing needs should grow or change. (And I'm sure plenty of folks would say "when" rather than "if"! For example, someone already commented that you may need a more sophisticated notion of what needs to be allowed as a chemical name. Maybe your requirements are already changing before you've even chosen your tool!)

John Y