tags:

views:

1754

answers:

7

I want to be replace any occurrence of more than one space with a single space, but take no action in text between quotes.

Is there any way of doing this with a Java regex? If so, can you please attempt it or give me a hint?

A: 

text between quotes : Are the quotes within the same line or multiple lines ?

anjanb
+2  A: 

When trying to match something that can be contained within something else, it can be helpful to construct a regular expression that matches both, like this:

("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

This will match a quoted string or two or more spaces. Because the two expressions are combined, it will match a quoted string OR two or more spaces, but not spaces within quotes. Using this expression, you will need to examine each match to determine if it is a quoted string or two or more spaces and act accordingly:

Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
{
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
    {
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );
    }
}

spaceOrStringMatcher.appendTail( replacementBuffer );
Jeff Hillman
A: 

Tokenize it and emit a single space between tokens. A quick google for "java tokenizer that handles quotes" turned up: this link

YMMV

edit: SO didn't like that link. Here's the google search link: google. It was the first result.

Niniki
A: 

Personally, I don't use Java, but this RegExp could do the trick:

([^\" ])*(\\\".*?\\\")*

Trying the expression with RegExBuddy, it generates this code, looks fine to me:

try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
        }
    }
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

At least, it seems to work fine in Python:

import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea
"""

ret = ""
print text 

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret
PabloG
A: 

After you parse out the quoted content, run this on the rest, in bulk or piece by piece as necessary:

String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"
Dov Wasserman
A: 

Jeff, you're on the right track, but there are a few errors in your code, to wit: (1) You forgot to escape the quotation marks inside the negated character classes; (2) The parens inside the first capturing group should have been of the non-capturing variety; (3) If the second set of capturing parens doesn't participate in a match, group(2) returns null, and you're not testing for that; and (4) If you test for two or more spaces in the regex instead of one or more, you don't need to check the length of the match later on. Here's the revised code:

import java.util.regex.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
    {
      if ( m.group( 2 ) != null ) 
      {
        m.appendReplacement( sb, " " );
      }
    }
    m.appendTail( sb );
    System.out.println( sb.toString() );
  }
}
Alan Moore
@Alan - Thanks. I updated my answer accordingly.
Jeff Hillman
Cool. I should have done this in the form of a comment, but with 300 characters and no formatting, that just wasn't possible.
Alan Moore
+1  A: 

Here's another approach, that uses a lookahead to determine that all quotation marks after the current position come in matched pairs.

text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

If needed, the lookahead can be adapted to handle escaped quotation marks inside the quoted sections.

Alan Moore