tags:

views:

285

answers:

2

Hi, I'm using the Scanner class in java to go through a a text file and extract each sentence. I'm using the setDelimiter method on my Scanner to the regex:

Pattern.compile("[\\w]*[\\.|?|!][\\s]")

This currently seems to work, but it leaves the whitespace at the end of the sentence. Is there an easy way to match the whitespace at the end but not include it in the result?

I realize this is probably an easy question but I've never used regex before so go easy :)

A: 

What you're looking for is a positive lookahead. This should do it:

Pattern.compile("\\w*[.?!](?=\\s)")
WoLpH
Thanks for your help but that didn't seem to work..My original one produced the following with two sentences (note the spaces at the end):"The quick brown fox jumps over the lazy ""Here is another sentence that will go in the test "Yours seemed to produce the following:"The quick brown fox jumps over the lazy "" Here is another sentence that will go in the test "
Gary
Just realised that the last word is also going missing, any idea why?
Gary
@WoLpH: Shouldn't that be Pattern.compile("\\w*[.?!](?=\\s)"), given that there are different semantics for expressions inside character classes as opposed to normal?
ig0774
Indeed ig0774, I'll change it.
WoLpH
@Gary: try the revised version. The original regex had a few flaws
WoLpH
+2  A: 

Try this:

"(?<=[.!?])\\s+"

This uses lookarounds to match \\s+ preceded by [.!?].


If you want to remove the punctuations as well, then just include it as part of the match:

"[.!?]+\\s+"

This will split "ORLY!?!? LOL" into "ORLY" and "LOL"

polygenelubricants
this only matches words, but does not stop at the end of a sentence. thanks for trying though!
Gary
@Gary: sorry, now fixed. Try again.
polygenelubricants
that does everything but remove the period at the end! is there an easy way to remove the period with regex or should i just alter the string afterward?Edit: forgot to say that I was also wanting to ignore commas, should i do this in regex or manually?
Gary
What do you mean by ignore commas? Right now this regex doesn't consider commas as sentence delimiters. Do you want it to?
polygenelubricants
Nevermind, on further thought: it probably isn't the job of this regex to do that. Thanks a lot for your help :)
Gary