tags:

views:

263

answers:

4

It seems simple, but I can't get it work.

I have a string which look like 'NNDDDDDAAAA', where 'N' is non digit, 'D' is digit, and 'A' is anything. I need to replace each A with a space character. Number of 'N's, 'D's, and 'A's in an input string is always different.

I know how to do it with two expressions. I can split a string in to two, and then replace everything in second group with spaces. Like this

    Pattern pattern = Pattern.compile("(\\D+\\d+)(.+)");
    Matcher matcher = pattern.matcher(input);
    if (matcher.matches()) {
        return matcher.group(1) + matcher.group(2).replaceAll(".", " ");
    }

But I was wondering if it is possible with a single regex expression.

+1  A: 

what do you mean by nondigit vs anything?

[^a-zA-Z0-9]
matches everything that is not a letter or digit.

you would want to replace anything that gets matched by the above regex with a space.

is this what you were talking about?

Robert Greiner
Don't you mean /[^a-zA-Z0-9]/ /g ?
BryanH
that would delete the "anything" matches, I just wanted to throw the regex up that actually matches "anything" I will take the slashes out to clear things up. Thanks.
Robert Greiner
'anything' means anything, i.e. letters, digits, whitespace. I want replace each occurrence with a space. For instance, 'AA12345d4 %' would be replaced with 'AA12345 ' (four spaces at the end)
vvs
+1  A: 

You want to use positive look behind to match the N's and D's then use a normal match for the A's.

Not sure of the positive look behind grammar in Java, but some article on Java regex with look behind

Simeon Pilgrim
I was just about to post that ... honest! Don't know if you are allowed to have a variable length look behind pattern though eg (?<=\D+)
Amal Sirisena
Not sure, about the Java regex: I've read some articles talking about pos/neg look ahead/behind restrictions in the three major variants of regex engines and the main take away I had was the the .Net regex could do the good stuff, but sometimes just because it can doesn't mean you should.
Simeon Pilgrim
Here's a nice description of various engines' support for look behind: http://www.regular-expressions.info/lookaround.html#limitbehind
laz
No, in general they do not allow variable-width look behind. "(?<=\D+)" is allowed because it is equivalent to the fixed-width look behind "(?<=\D)"
newacct
And in any case, even if a look behind worked, it would not solve the OP's problem, which is to replace every character in the matched group with a space. There is no replacement string that will allow you to perform "replace this with a string of spaces of the same length".
newacct
@newacct: good point.
Simeon Pilgrim
+3  A: 

Given your description, I'm assuming that after the NNDDDDD portion, the first A will actually be a N rather than an A, since otherwise there's no solid boundary between the DDDDD and AAAA portions.

So, your string actually looks like NNDDDDDNAAA, and you want to replace the NAAA portion with spaces. Given this, the regex can be rewritten as such: (\\D+\\d+)(\\D.+)

Positive lookbehind in Java requires a fixed length pattern; You can't use the + or * patterns. You can instead use the curly braces and specify a maximum length. For instance, you can use {1,9} in place of each +, and it will match between 1 and 9 characters: (?<=\\D{1,9}\\d{1,9})(\\D.+)

The only problem here is you're matching the NAAA sequence as a single match, so using "NNNDDDDNAAA".replaceAll("(?<=\\D{1,9}\\d{1,9})(\\D.+)", " ") will result in replacing the entire NAAA sequence with a single space, rather than multiple spaces.

You could take the beginning delimiter of the match, and the string length, and use that to append the correct number of spaces, but I don't see the point. I think you're better off with your original solution; Its simple and easy to follow.

If you're looking for a little extra speed, you could compile your Pattern outside the function, and use StringBuilder or StringBuffer to create your output. If you're building a large String out of all these NNDDDDDAAAAA elements, work entirely in StringBuilder until you're done appending.

class Test {

public static Pattern p = Pattern.compile("(\\D+\\d+)(\\D.+)");

public static StringBuffer replace( String input ) {
    StringBuffer output = new StringBuffer();
    Matcher m = Test.p.matcher(input);
    if( m.matches() )
        output.append( m.group(1) ).append( m.group(2).replaceAll("."," ") );

    return output;
}

public static void main( String[] args ) {
    String input = args[0];
    long startTime;

    StringBuffer tests = new StringBuffer();
    startTime = System.currentTimeMillis();
     for( int i = 0; i < 50; i++)
     {
      tests.append( "Input -> Output: '" );
      tests.append( input );
      tests.append( "' -> '" );
      tests.append( Test.replace( input ) );
      tests.append( "'\n" );
     }
    System.out.println( tests.toString() );
    System.out.println( "\n" + (System.currentTimeMillis()-startTime));
}

}

Update: I wrote a quick iterative solution, and ran some random data through both. The iterative solution is around 4-5x faster.

public static StringBuffer replace( String input )
{
    StringBuffer output = new StringBuffer();
 boolean second = false, third = false;
 for( int i = 0; i < input.length(); i++ )
 {
  if( !second && Character.isDigit(input.charAt(i)) )
   second = true;

  if( second && !third && Character.isLetter(input.charAt(i)) )
   third = true;

  if( second && third )
   output.append( ' ' );
  else
   output.append( input.charAt(i) );

 }

    return output;
}
Curtis Tasker
+1  A: 

I know you asked for a regex, but why do you even need a regex for this? How about:

StringBuilder sb = new StringBuilder(inputString);
for (int i = sb.length() - 1; i >= 0; i--) {
    if (Character.isDigit(sb.charAt(i)))
        break;
    sb.setCharAt(i, ' ');
}
String output = sb.toString();

You might find this post interesting. Of course, the above code assumes there will be at least one digit in the string - all characters following the last digit are converted to spaces. If there are no digits, every character is converted to a space.

Vinay Sajip
I think you are right. I was refactoring some old code which has multiple loops and indexOf()/substring() and I thought it could be done with a simple regex. Didn't even think about cleaning up the old logic. I think your approach would be the most efficient for this task. Thanks for thinking outside the box, i.e. my initial requirements.
vvs
Your code assumes that the AAA portion will be non-digits. This is contrary to the problem description, which says that A will be 'anything', which could include digits.
Curtis Tasker
Well then, the solution can be slightly adapted to locate the point where a digit is followed by a non-digit. It still ends up being simpler than using regexes where they're not really necessary.
Vinay Sajip
yes, I had to add additional logic to find the point where digits are allowed. Still pretty simple
vvs