tags:

views:

114

answers:

6

Hi People,

Just need your help regarding a task to search in Java. I need to read a line from a file and make a list of all the words that have more than 1 capital letter in them.

For example if the line is : There are SeVen Planets In this UniverSe

The result should be : SeVen and UniverSe

I am able to read the line by splitting it into words but some how not able to use the correct regular expression to search for these words.

The following is a small example I used but it returns false although I think it should return true.

System.out.println("ThiS".matches("[A-Z]{2,}"));

Can anyone please have a look at this and suggest ways to achieve my result? Appreciate any help.

Thanks,

AJ

+1  A: 

The regular expression you listed is not going to work because it will search for a contiguous sequence of 2 or more upper case letters.

I think what you need to do is to write an expression that lets you allow lowercase letters on both sides.

I don't remember the exact syntax (I'm going to check) but something like .*[A-Z].*[A-Z].* will ensure that you have two upper cases

Uri
+7  A: 

[A-Z]{2,} means 2 or more consecutive upper case letters. You could use [A-Z].*[A-Z] which would allow for any other characters to appear between the two uppercase letters.

Alternatively, you don't really need to use regex for this. If you prefer you could just iterate over each character in the string and use Character.isUpperCase and count the number of matching characters.

mikej
+1 for "not every problem needs to be solved with a regex"
David Gelhar
I like the String iteration. It's simple and understandable.
extraneon
That would match `SeVen Planets In this UniverS`, with lazy it'd match `SeV`. You need to define word borders with `\b[A-Z].*?[A-Z].*?\b`, which is pretty inefficient. Maybe atomic grouping or look ahead would be better.
@fy-tide From the wording of the question I am assuming the OP has already split the string into individual words so we just need a method for getting true/false to indicate if a single word has >= 2 upper case letters.
mikej
@fy-tide: The word boundary solution wouldn't work either, as it would match `The Earth` just fine since there is a word boundary before "The" and one after "Earth", and `.` matches spaces.
Mark Peters
@Mark Peters True, \w boundaries would be better for the whole string like in the other answers.
+2  A: 

Maybe [a-z]*[A-Z][a-z]*[A-Z][a-z]* can work.. the fact is that counting with {..} doesn't allow chars between the two letters.

Jack
+1  A: 
\b(?:[a-z]*[A-Z]){2}[a-z]*\b

will match words that contain at least two uppercase letters.

If you want to allow words that contain other letters than ASCII, use

\b(?:\p{Ll}*\p{Lu}){2}\p{Ll}*\b

Of course, in a Java string, you need to escape (double) the backslashes.

So you get:

Pattern regex = Pattern.compile("\\b(?:\\p{Ll}*\\p{Lu}){2}\\p{Ll}*\\b");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    // matched text: regexMatcher.group()
    // match start: regexMatcher.start()
    // match end: regexMatcher.end()
}
Tim Pietzcker
A: 

Your current regular expression matches only a sequence of two or more upper case letters, not multiples spread throughout the word. So, you would match THis and tHIS but not ThiS as you have discovered.

You need to look for an upper case letter, maybe some lower case, and then another upper. Or in regex: [A-Z]\w*?[A-Z]

If you want to search the whole string without needing to split it first, then include the possibility of other word characters on either end and let the expression capture: (\w*?[A-Z]\w*?[A-Z]\w*)

Also note that we are using reluctant quantifiers so that they stop matching at the earliest opportunity in the first two instances, and the normal (greedy) quantifier at the end to pick up the rest of the word. Read more about the various quantifiers here.

MikeD
A: 
    Pattern pat = Pattern.compile("\\w*[A-Z]\\w*[A-Z]\\w*");
    Matcher matcher = pat.matcher("There are SeVen Planets In this UniverSe");
    while ( matcher.find() ) {
        System.out.println(matcher.group());
    }

Prints

SeVen
UniverSe

I'm horrible with regex though so there's probably a simpler way. This way's really easy to understand though: start at the beginning of a word, match 0 or more characters, then an upper-case character, then 0 or more characters, then another upper-case character, then 0 or more characters.

Mark Peters
You don't need the word boundaries since word characters won't match the preceding non-word characters anyway.
MikeD
@MikeD, good point; when I put that in I wasn't initially using the word class.
Mark Peters
Hey Mark...must say you are great with regex! The sample code you gave seems to be working :) Thanks a lot for your inputs. MikeJ, Jack, Uri, Tim and MikeD...Thanks guys for your quick responseBy the way, regex made me forget how many planets there are actually :pAppreciate all of your time..Thanks again!-AJS
AJS