tags:

views:

377

answers:

4

Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters. I thought something like \s\w{1,2}\s would grab all the 1 and 2 letter words (a whitespace, one to two word characters and another whitespace), but it just doesn't work. Where am I wrong?

+1  A: 

Try: \b\w{1,2}\b although you will still have to get rid of the double spaces that will show up.

GameFreak
+2  A: 

If you don't want the whitespace matched, you might want to use

\b\w{1,2}\b

to get the word boundaries.

That's working for me in RegexBuddy using the Java flavor; for the test string

"The dog is fun a cat"

it highlights "is" and "a". Similarly for words at the beginning/end of a line.

You might want to post a code sample.

(And, as GameFreak just posted, you'll still end up with double spaces.)

EDIT:

\b\w{1,2}\b\s?

is another option. This will partially fix the space-stripping issue, although words at the end of a string or followed by punctuation can still cause issues. For example, "A dog is fun no?" becomes "dog fun ?" In any case, you're still going to have issues with capitalization (dog should now be Dog).

TrueWill
Please leave a comment when downvoting. I was composing this at the same time GameFreak posted his answer; I did not simply copy it.
TrueWill
i modded you up.
janesconference
Thank you @janesconference! :)
TrueWill
If you slightly changed that to \b\w{1,2}\s?\b, that might address the whitespace issue.
CaptainAwesomePants
@CaptainAwesomePants: Synchronicity - I was just editing my answer. :) I don't think you need the final slash b, though.
TrueWill
+1  A: 

If you have a string like this:

hello there my this is a short word

This regex will match all words in the string greater than or equal to 3 characters in length:

\w{3,}

Resulting in:

hello there this short word

That, to me, is the easiest approach. Why try to match what you don't want, when you can match what you want a lot easier? No double spaces, no leftovers, and the punctuation is under your control. The other approaches break on multiple spaces and aren't very robust.

Jed Smith
This seems a good idea, but I'm currently using GWT and have only the String class regex methods. So I can split my string or replace my regexps, but I don't kwnow how to get only the things I matched :(
janesconference
+5  A: 

I've got it working fairly well, but it took two passes.

public static void main(String[] args) {
    String passage = "Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.";
    System.out.println(passage);

    passage = passage.replaceAll("\\b[\\w']{1,2}\\b", "");
    passage = passage.replaceAll("\\s{2,}", " ");

    System.out.println(passage);
}

The first pass replaces all words containing less than three characters with a single space. Note that I had to include the apostrophe in the character class to eliminate because the word "I'm" was giving me trouble without it. You may find other special characters in your text that you also need to include here.

The second pass is necessary because the first pass left a few spots where there were double spaces. This just collapses all occurrences of 2 or more spaces down to one. It's up to you whether you need to keep this or not, but I think it's better with the spaces collapsed.

Output:

Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.

Well, looking for regexp Java that deletes all words shorter than characters.

Bill the Lizard
Yep, this works pretty well.
janesconference