views:

379

answers:

3

InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .

ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries

Generalized Pattern Tried:

       ".[\s]?(\w+?)"+                 // bruises.
      "(?:(\s)?,(\s)?(\w+?))*"+             // wounds marks dislocations
      "[\s]?(?:or|and) other (\w+).";     // Injuries

The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.

On trying the generalized pattern above, the output is: bruises dislocations Injuries

There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.

Could you please suggest what should be the right pattern, and where is the mistake? This question comes closest to this question, but that solution didn't help.

Thanks.

A: 

Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.

You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.

EDIT

PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.

Bart Kiers
I agree with you completely. Perl and Python may be the best when it comes to text processing. but the work is in java. This work on Patterns is a small sub module. So, need to find a solution for this regex problem in java!
niks
Well, what can I say? There is really no viable way to extract these words from in input string like `A soldier may have bruiser or other injuries that hurt him` using regex. Really.
Bart Kiers
Note that you don't need Perl or Python for this. Java can do this just as well. Regex simply isn't the right tool for this job.
Bart Kiers
Thanks for this suggestion. Could you suggest any non-regex java solution please..
niks
http://gate.ac.uk/
PSpeed
This is really nice tool like (more than perhaps) UIMA. Thanks!!
niks
A: 

The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV There is no other method to do this with Java Regex.

Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk
Thanks to Bart K. , and PSpeed.

niks
A: 

When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.

How to fix: (?:(\s)?,(\s)?(\w+?))*

Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.

And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/

PSpeed