tags:

views:

146

answers:

6

Hi All,

I am stuck up with a problem while using Regular Expression. My requirement is : split a long string into maximum size of 125 letters and then insert a line break in between them. while splitting, it shouldn't split between the words. in short, i want to split a string into small strings whose length is 125 or at the end of word before 125th letter. Hope i didnt confused

i used one regexp to solve this, and believe me am an absolute zero in this. i just got one code and copy pasted ;-)

StringBuffer result = null;  
while(mailBody.trim().length() > 0){  
    Matcher m = Pattern.compile("^.{0,125}\\b").matcher(mailBody);  
    m.find();  
    String oneLineString = m.group(0);  
    if(result == null)  
        result = new StringBuffer(oneLineString);  
    else  
        result.append("\n"+ oneLineString);  
    mailBody = mailBody.substring(oneLineString.length(),
                                  mailBody.length()).trim();  
}

this is my code, and it's working perfectly unless the starting string ends with a full stop(.). In that case it is giving an error like : No match found.

Please help.

Regards, Anoop P K

+1  A: 

Can you try using the following instead?

Matcher m = Pattern.compile("(?:^.{0,125}\\b)|(?:^.{0,125}$)").matcher(mailBody);

Here we use your original match OR we match a string whose total length is 125 characters or fewer. The (?:X) items are non-capturing groups, so that I can use the | operator on the large groups.

(See documentation for the Pattern class here.)


Addendum: @Anoop: Quite right, having sentence-ending punctuation left off on its own line is undesirable behavior. You can try this instead:

if(result == null)  
   result = new StringBuffer("");

mailBody = mailBody.trim();

while(mailBody.length() > 125) {

    // Try not to break immediately before closing punctuation
    Matcher m = Pattern.compile("^.{1,125}\\b(?![-\\.?;&)])").matcher(mailBody);
    String oneLineString;

    // Found a safe place to break string
    if (m.find()) {

        oneLineString = m.group(0);

    // Forced to break string in an ugly fashion
    } else {

        // Try to break at any word boundary at least
        m = Pattern.compile("^.{1,125}\\b").matcher(mailBody);

        if (m.find()) {

            oneLineString = m.group(0);

        // Last ditch scenario, just break at 125 characters
        } else {

            oneLineString = mailBody.substring(0,124);

        }

    }

    result.append(oneLineString + "\n");
    mailBody = mailBody.substring(oneLineString.length(),
                                  mailBody.length()).trim();  
}

result.append(mailBody);
Conspicuous Compiler
Hi Buddy,Perfectt!! now the error is gone. But another prob came like, the last full stop is coming into the next line. not even full stop, any special character which is coming as the last character or a series of such characters are coming in new line
Anoop
Expanded the function to deal better with short strings and problematic strings. Added a negative lookahead assertion to the first match to avoid breaking at sentence-ending punctuation, plus a fallback function in case you just got a blob of text more than 125 characters long. Also, made the minimum match distance 1 instead of 0 to avoid problems with strings like 125 characters followed by a space.
Conspicuous Compiler
+1  A: 

Rather than using regexes directly, consider using a java.text.BreakIterator -- this is what it's designed for.

Jim Garrison
which will be better in performancewise? regexp or breakIterator. I actually started off with BreakIterator, but i was facing some difficulty. hence switched to this
Anoop
+1  A: 

First, you can technically get the same results with a simpler pattern and the lookingAt() method which makes your intent more obvious. Also, it's good to pull the pattern compilation out of the loop.

I think your regex is nice and simple though you might want to explicitly define what you mean by a word break rather than relying on what word boundary means. It sounds like you want to capture the period and break after but the \b won't do that. You can instead break on whitespace...

Edit: Even simpler now...

StringBuilder result = null;  
Pattern pattern = Pattern.compile( ".{0,125}\\s|.{0,125}" );
Matcher m = pattern.matcher(mailBody);
while( m.find() ) {
    String s = m.group(0).trim();
    if( result == null ) {
        result = new StringBuilder(s);  
    } else {
        result.append(s);
    }
}

...I think the new improved edits are even simpler and still do what you want.

The pattern can be adjusted if there are other characters that would be considered breakable characters:

Pattern.compile( ".{0,125}[\\s+&]|.{0,125}" );

...and so on. That would allow breaks on whitespace, + chars, and & chars as an example.

PSpeed
+2  A: 

I cannot yet comment, the answers given are good. I would add that you should initialize your StringBuffer before the loop and to reduce copying, start it at least as large as your original string, like so:

StringBuffer result = new StringBuffer(mailBody.length());

Then in the loop there would be no need to check for result == null.

Edit: Comment on PSpeed answer... Needs to add new lines in each new line added to match the original, something like this (assuming result is already initialized as I suggest):

while (m.find()) {
    if (result.length() > 0)
        result.append("\n");
    result.append(m.group().trim());
}
Kevin Brock
A: 

Regexr

Moshe
A: 

The exception isn't being caused by your regex, it's because you're using the API incorrectly. You're supposed to check the return value of the find() method before you call group() -- that's how you know the match succeeded.

EDIT: Here's what's happening: when you get to the final chunk of text, the regex originally matches all the way to the end. But \b can't match at that position because the last character is a period (or full stop), not a word character. So it backtracks one position, and then \b can match between the final letter and the period.

Then it tries to match another chunk because mailBody.trim().length() is still greater than zero. But this time there are no word characters at all, so the match attempt fails and m.find() returns false. But you don't check the return value, you just go ahead and call m.group(0), which correctly throws an exception. You should be using m.find() as the while condition, not that business with the string length.

In fact, you're doing a lot more work than you need to; if you use the API correctly you can reduce your code to one line:

mailBody = mailBody.replaceAll(
    "\\G(\\w{125}|.{1,123}(?<=\\w\\b)[.,!?;:/\"-]*)\\s*",
    "$1\n" ).trim();

The regex isn't perfect--I don't think that's possible--but it might do well enough.

Alan Moore
He's using it correctly. (Try to run the code yourself and see.) Group 0 is the entire matched pattern, a common behavior for regex libraries. (Even Perl has $0 be the whole matched pattern.)
Conspicuous Compiler
I didn't say he shouldn't use `group(0)` or `group()`, I said that he should check the return value of `m.find()` first. See my edit for details.
Alan Moore