views:

2392

answers:

5

Hi!

I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.

scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));

The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.

How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?

Thanks ;)

A: 

This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.

It might be easier to just ignore words returned by scanner consisting entirely of hyphens

Hemal Pandya
A: 

If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.

Maybe this is what you are looking for:

string.split("\\s+(\\W*\\s)?"

Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.

CurtainDog
I should also point out that regex patterns need to be double escaped otherwise the compiler will complain that \foo isn't a valid string character.
CurtainDog
+1  A: 

OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?

Example:

This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.

So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".

"[.,:;()?!\"\s]+|\s+-\s+"
Svante
A: 
Scanner scanner = new Scanner("one   two2  -   (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");

while (scanner.hasNext()) {
    System.out.println(scanner.next("\\w+(-\\w+)*"));
}

NB

the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"

NB

you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"

dfa
\B is a zero-width assertion; it matches a position that is not a word boundary. It doesn't consumer any characters, so it makes no sense to add a '+' or any other quantifier to it. Java just ignores the quantifier, but some other regex flavors treat it as a syntax error.
Alan Moore
Also, the OP isn't using "\r\n|\n". He isn't interested in line separators at all. He's just matching the most common whitespace characters along with the punctuation characters in the character class (but he should be using "\s", like @Svante did).
Alan Moore
he used \r\n in useDelimiter(), btw thanks for the first clarification! :)
dfa
A: 

This should be a simple enough: [^\\w-]\\W*|-\\W+

  • But of course if it's prose, and you want to exclude underscores:
    [^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
  • or if you don't expect numerics:
    [^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+

EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)

Axeman