ansaurus

Question

[Java] use of delimiter function from scanner for "abc-def"

Answer 1

A:

This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.

It might be easier to just ignore words returned by scanner consisting entirely of hyphens

Hemal Pandya 2009-04-15 10:10:14

Answer 2

A:

If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.

Maybe this is what you are looking for:

string.split("\\s+(\\W*\\s)?"

Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.

CurtainDog 2009-04-15 10:14:51

I should also point out that regex patterns need to be double escaped otherwise the compiler will complain that \foo isn't a valid string character.

CurtainDog 2009-04-15 10:22:32

Answer 3

+1 A:

OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?

Example:

This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.

So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".

"[.,:;()?!\"\s]+|\s+-\s+"

Svante 2009-04-15 10:16:24

Answer 4

A:

Scanner scanner = new Scanner("one   two2  -   (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");

while (scanner.hasNext()) {
    System.out.println(scanner.next("\\w+(-\\w+)*"));
}

NB

the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"

NB

you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"

dfa 2009-04-15 10:59:30

\B is a zero-width assertion; it matches a position that is not a word boundary. It doesn't consumer any characters, so it makes no sense to add a '+' or any other quantifier to it. Java just ignores the quantifier, but some other regex flavors treat it as a syntax error.

Alan Moore 2009-04-15 17:26:28

Also, the OP isn't using "\r\n|\n". He isn't interested in line separators at all. He's just matching the most common whitespace characters along with the punctuation characters in the character class (but he should be using "\s", like @Svante did).

Alan Moore 2009-04-15 17:37:47

he used \r\n in useDelimiter(), btw thanks for the first clarification! :)

dfa 2009-04-15 17:51:42

Answer 5

A:

This should be a simple enough: [^\\w-]\\W*|-\\W+

But of course if it's prose, and you want to exclude underscores:
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
or if you don't expect numerics:
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+

EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)

Axeman 2009-04-15 16:11:53

ansaurus

tags:

views:

answers:

[Java] use of delimiter function from scanner for "abc-def"

related questions