views:

299

answers:

2

Hi all

I have the following string "3/4Ton". I want to split it as -->

word[1] = 3/4 and word[2] = Ton.

Right now my piece of code looks like this:-

Pattern p = Pattern.compile("[A-Z]{1}[a-z]+");
Matcher m = p.matcher(line);
while(m.find()){
    System.out.println("The word --> "+m.group());
    }

It carries out the needed task of splitting the string based on capital letters like:-

String = MachineryInput

word[1] = Machinery , word[2] = Input

The only problem is it does not preserve, numbers or abbreviations or sequences of capital letters which are not meant to be separate words. Could some one help me out with my regular expression coding problem.

Thanks in advance...

+2  A: 

Using regex would be nice here. I bet there's a way to do it too, although I'm not a swing-in-on-a-vine regex guy so I can't help you. However, there's something you can't avoid - something, somewhere needs to loop over your String eventually. You could do this "on your own" like so:

String[] splitOnCapitals(String str) {
    ArrayList<String> array = new ArrayList<String>();
    StringBuilder builder = new StringBuilder();
    int min = 0;
    int max = 0;
    for(int i = 0; i < str.length(); i++) {
        if(Character.isUpperCase(str.charAt(i))) {
            String line = builder.toString().trim();
            if (line.length() > 0) array.add(line);
            builder = new StringBuilder();
        }
        builder.append(str.charAt(i));
    }
    array.add(builder.toString().trim()); // get the last little bit too
    return array.toArray(new String[0]);
}

I tested it with the following test driver:

public static void main(String[] args) {
    String test = "3/4 Ton truCk";
    String[] arr = splitOnCapitals(test);
    for(String s : arr) System.out.println(s);

    test = "Start with Capital";
    arr = splitOnCapitals(test);
    for(String s : arr) System.out.println(s);
}

And got the following output:

3/4
Ton tru
Ck
Start with
Capital
glowcoder
Thank you for your help. It definitely gives me a sense of direction and has showed me a different approach.
rookie
+3  A: 

You can actually do this in regex alone using look ahead and look behind (see special constructs on this page: http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html )

/**
 * We'll use this pattern as divider to split the string into an array.
 * Usage: myString.split(DIVIDER_PATTERN);
 */
private static final String DIVIDER_PATTERN =

        "(?<=[^\\p{Upper}])(?=\\p{Upper})"
        // either there is anything that is not an uppercase character
        // followed by an uppercase character

        + "|(?<=[\\p{Lower}])(?=\\d)"
        // or there is a lowercase character followed by a digit

        ;

@Test
public void testStringSplitting() {

    Assert.assertTrue("3/4Word".split(DIVIDER_PATTERN).length == 2);
    Assert.assertTrue("ManyManyWordsInThisBigThing"
                    .split(DIVIDER_PATTERN).length == 7);
    Assert.assertTrue("This123/4Mixed567ThingIsDifficult"
            .split(DIVIDER_PATTERN).length == 7);
}

So what you can do is something like this:

for(String word:myString.split(DIVIDER_PATTERN)){
    System.out.println(word);
}

Sean

seanizer
I am getting a syntax error at this line"|(?<=[\\p{Lower}])(?=\\d)"not sure why. Can you help me please?
rookie
you're right, there is a + missing (got lost when I added docs). I'll add it right away.
seanizer
thank you very much for your help!
rookie