Hi all,
I'm parsing human-readable scientific text that is mostly in the field of chemistry. What I'm interested in is breaking the text into a list of words, scientific terms (more on that below), and punctuation marks.
So for example, I expect the text "hello, world." to break into 4 tokens: 1) "hello"; 2) comma; 3) "world" and 4) period. Note that spaces don't require specialized tokens.
The problem is related to the "scientific terms": these are names of chemical formulas such as "1-methyl-4-phenylpyridinium". Anyone who has ever learned chemistry knows these formulas can get quite long and may contain numbers, dashes and commas, and sometimes even parentheses, but I think it's safe to assume these lovely expressions can't contain spaces. Also, I believe these expressions must start with a number. I would like each such expression to come out as a single token.
Today I use manual parsing to find "chunks" of text that begin with a number and end with either a space, a line break, or a punctuation mark followed by either a space or line break.
I wondered if there's a smart solution (regex or other) I can use to tokenize the text according to the above specifications. I'm working in Python but this may be language agnostic.
An example input (obviously disregard the content...):
"Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine is worse."
Example output (each token in its own line):
Hello
.
1-methyl-4-phenylpyridinium
is
ultra
-
bad
.
However
,
1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine
is
worse
.