views:

563

answers:

8

I have a string which contains digits and letters. I wish to split the string into contiguous chunks of digits and contiguous chunks of letters.

Consider the String "34A312O5M444123A".

I would like to output: ["34", "A", "312", "O", "5", "M", "444123", "A"]

I have code which works and looks like:

List<String> digitsAsElements(String str){
  StringBuilder digitCollector = new StringBuilder();

  List<String> output = new ArrayList<String>();

  for (int i = 0; i < str.length(); i++){
    char cChar = str.charAt(i);

    if (Character.isDigit(cChar))
       digitCollector.append(cChar);
    else{
      output.add(digitCollector.toString());
      output.add(""+cChar);

      digitCollector = new StringBuilder();
    }         
  }

  return output;
}

I considered splitting str twice to get an array containing all the numbers chunks and an array containing the all letters chunks. Then merging the results. I shied away from this as it would harm readability.

I have intentionally avoided solving this with a regex pattern as I find regex patterns to be a major impediment to readability.

  • Debuggers don't handle them well.
  • They interrupt the flow of someone reading source code.
  • Overtime regex's grow organically and become monsters.
  • They are deeply non intuitive.

My questions are:

  • How could I improve the readability of the above code?
  • Is there a better way to do this? A Util class that solves this problem elegantly.
  • Where do you draw the line between using a regEx and coding something simpilar to what I've written above?
  • How do you increase the readability/maintainability of regExes?
+12  A: 

For this particular task I'd always use a regex instead of hand-writing something similar. The code you have given above is, at least to me, less readable than a simple regular expression (which would be (\d+|[^\d]+) in this case, as far as I can see).

You may want to avoid writing regular expressions that exceed a few lines. Those can be and usually are unreadable and hard to understand, but so is the code they can be replaced with! Parsers are almost never pretty and you're usually better off reading the original grammar than trying to make sense of the generated (or handwritten) parser. Same goes (imho) for regexes which are just a concise description of a regular grammar.

So, in general I'd say banning regexes in favor of code like you've given in your question sounds like a terribly stupid idea. And regular expressions are just a tool, nothing less, nothing more. If something else does a better job of text parsing (say, a real parser, some substring magic, etc.) then use it. But don't throw away possibilities just because you feel uncomfortable with them – others may have less problems coping with them and all people are able to learn.

EDIT: Updated regex after comment by mmyers.

Joey
+1, not all regex is evil or ugly.
yx
+1, certainly regEx's have their place! The problem is not what the initial regex looks like, but what the regex looks like after 10 people over 5 years have amended it with special cases. It would be neat if there was something with the elegance of regex, but with the self documenting nature (and debugability) of Java.
e5
The regex should be (\d+|[^\d]+), or else it will grab everything starting at the first non-digit. Shame on you for misleading the people who actually posted code. :P
Michael Myers
I'd say it proves me point about regex if someone hadn't found a bug in the java I posted above.
e5
Re your edit: That's not what I meant. The problem with (\d+|\w+) is that \w includes all of \d (it's [a-zA-Z0-9], at least in Java), so it matches the entire expression if it gets a chance.
Michael Myers
mmyers: I stand corrected, thanks. Ok, e5, I've found a non-intuitive aspect of regexes. At least when you're writing them only once every few months.
Joey
+2  A: 

I would use something like this (warning, untested code). For me this is a lot more readable than trying to avoid regexps. Regexps are a great tool when used in right place.

Commenting methods and providing examples of input and output values in comments also helps.

List<String> digitsAsElements(String str){
    Pattern p = Pattern.compile("(\\d+|\\w+)*");
    Matcher m = p.matcher(str);

    List<String> output = new ArrayList<String>();
    for(int i = 1; i <= m.groupCount(); i++) {
       output.add(m.group(i));
    }
    return output;
}
Juha Syrjälä
+1  A: 

Awww, someone beat me to code. I think the regex version is easier to read/maintain. Also, note the difference in output between the 2 implementations vs the expected output ...

Output:

digitsAsElements1("34A312O5MNI444123A") = [34, A, 312, O, 5, M, , N, , I, 444123, A]
digitsAsElements2("34A312O5MNI444123A") = [34, A, 312, O, 5, MNI, 444123, A]
Expected: [34, A, 312, O, 5, MN, 444123, A]

Compare:

DigitsAsElements.java:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class DigitsAsElements {

    static List<String> digitsAsElements1(String str){
        StringBuilder digitCollector = new StringBuilder();

        List<String> output = new ArrayList<String>();

        for (int i = 0; i < str.length(); i++){
          char cChar = str.charAt(i);

          if (Character.isDigit(cChar))
             digitCollector.append(cChar);
          else{
            output.add(digitCollector.toString());
            output.add(""+cChar);

            digitCollector = new StringBuilder();
          }         
        }

        return output;
      }

    static List<String> digitsAsElements2(String str){
        // Match a consecutive series of digits or non-digits
        final Pattern pattern = Pattern.compile("(\\d+|\\D+)");
        final Matcher matcher = pattern.matcher(str);

        final List<String> output = new ArrayList<String>();
        while (matcher.find()) {
            output.add(matcher.group());
        }

        return output;
      }

    /**
     * @param args
     */
    public static void main(String[] args) {
        System.out.println("digitsAsElements(\"34A312O5MNI444123A\") = " +
                digitsAsElements1("34A312O5MNI444123A"));
        System.out.println("digitsAsElements2(\"34A312O5MNI444123A\") = " +
                digitsAsElements2("34A312O5MNI444123A"));
        System.out.println("Expected: [" +
                "34, A, 312, O, 5, MN, 444123, A"+"]");
    }

}
Bert F
Expected value should have MNI, not NM?
Juha Syrjälä
"Expected" is what the poster said they expected vs what their implementation (digitsAsElements1) and the regex version (digitsAsElements2) actually output.
Bert F
No fair - they editted the post :-)
Bert F
+7  A: 

For a utility class, check out java.util.Scanner. There are a number of options in there as to how you might go about solving your problem. I have a few comments on your questions.

Debuggers don't handle them (regular expressions) well

Whether a regex works or not depends on whats in your data. There are some nice plugins you can use to help you build a regex, like QuickREx for Eclipse, does a debugger actually help you write the right parser for your data?

They interrupt the flow of someone reading source code.

I guess it depends on how comfortable you are with them. Personally, I'd rather read a reasonable regex than 50 more lines of string parsing code, but maybe that's a personal thing.

Overtime regex's grow organically and become monsters.

I guess they might, but that's probably a problem with the code they live in becoming unfocussed. If the complexity of the source data is increasing, you probably need to keep an eye on whether you need a more expressive solution (maybe a parser generator like ANTLR)

They are deeply non intuitive.

They're a pattern matching language. I would say they're pretty intuitive in that context.

How could I improve the readability of the above code?

Not sure, apart from use a regex.

Is there a better way to do this? A Util class that solves this problem elegantly.

Mentioned above, java.util.Scanner.

Where do you draw the line between using a regEx and coding something simpilar to what I've written above?

Personally I use regex for anything reasonably simple.

How do you increase the readability/maintainability of regExes?

Think carefully before extending,take extra care to comment up the code and the regex in detail so that it's clear what you're doing.

Brabster
Wow it must've taken me a long time to write this! There was a load of answers in the meantime, sorry if I overlap.
Brabster
+1, Well done! This is exactly the answer I was looking for, I wish I could give you +10. =D
e5
Very happy to help. Good luck!
Brabster
"How do you increase the readability/maintainability of regExes?" - I will sometimes break up a long and/or complex regex along natural boundaries and give good names/comments to each piece. Since they are String literals, when the code concatenates the literals together, the compiler should do this at compile time, i.e. no performance cost for splitting it up into separate strings.
Bert F
"Overtime regex's grow organically and become monsters." This actually holds true for all code, in my experience. Caremust be taken when and how to extend existing code and that's no different for regular expressions.
Joey
It's a quantitative not qualitative difference. Compare 600 lines of java to 600 lines of regex. Compare how many mistakes people made in this post for a very basic regular expression to mistakes made in this post for slightly more complex java code. When is there going to be a "design patterns for regular expressions" book. =P
e5
@e5: http://www.amazon.com/Regular-Expressions-Cookbook-Jan-Goyvaerts/dp/0596520689/ - my copy arrived today. :)
Alan Moore
+1  A: 

you could use this class in order to simplify your loop:

public class StringIterator implements Iterator<Character> {

    private final char[] chars;
    private int i;

    private StringIterator(char[] chars) {
        this.chars = chars;
    }

    public boolean hasNext() {
        return i < chars.length;
    }

    public Character next() {
        return chars[i++];
    }

    public void remove() {
        throw new UnsupportedOperationException("Not supported.");
    }

    public static Iterable<Character> of(String string) {
        final char[] chars = string.toCharArray();

        return new Iterable<Character>() {

            @Override
            public Iterator<Character> iterator() {
                return new StringIterator(chars);
            }
        };
    }
}

Now you can rewrite this:

for (int i = 0; i < str.length(); i++){
    char cChar = str.charAt(i);
    ...
}

with:

for (Character cChar : StringIterator.of(str)) {
    ...
}

my 2 cents

BTW this class is also reusable in other context.

dfa
+1, StringIterator looks pretty neat.
e5
It doesn't scale, though. Every character has to be boxed for the Iterator<Character>, then unboxed for the foreach loop; that slaughters performance.
Alan Moore
you're right. I've fixed at least unboxing in the for loop
dfa
+1  A: 

I'm not overly crazy about regex myself, but this seems like a case where they will really simplify things. What you might want to do is put them into the smallest method you can devise, name it aptly, and then put all the control code in another method.

For instance, if you coded a "Grab block of numbers or letters" method, the caller would be a very simple, straight-forward loop just printing the results of each call, and the method you were calling would be well-defined so the intention of the regex would be clear even if you didn't know anything about the syntax, and the method would be bounded so people wouldn't be likely to muck it up over time.

The problem with this is that the regex tools are so simple and well-adapted to this use that it's hard to justify a method call for this.

Bill K
+1  A: 

Since no one seems to have posted correct code yet, I'll give it a shot.

First the non-regex version. Note that I use the StringBuilder for accumulating whichever type of character was seen last (digit or non-digit). If the state changes, I dump its contents into the list and start a new StringBuilder. This way consecutive non-digits are grouped just like consecutive digits are.

static List<String> digitsAsElements(String str) {
    StringBuilder collector = new StringBuilder();

    List<String> output = new ArrayList<String>();
    boolean lastWasDigit = false;
    for (int i = 0; i < str.length(); i++) {
        char cChar = str.charAt(i);

        boolean isDigit = Character.isDigit(cChar);
        if (isDigit != lastWasDigit) {
            if (collector.length() > 0) {
                output.add(collector.toString());
                collector = new StringBuilder();
            }
            lastWasDigit = isDigit;
        }
        collector.append(cChar);
    }
    if (collector.length() > 0)
        output.add(collector.toString());

    return output;
}

Now the regex version. This is basically the same code that was posted by Juha S., but the regex actually works.

private static final Pattern DIGIT_OR_NONDIGIT_STRING =
        Pattern.compile("(\\d+|[^\\d]+)");
static List<String> digitsAsElementsR(String str) {
    // Match a consecutive series of digits or non-digits
    final Matcher matcher = DIGIT_OR_NONDIGIT_STRING.matcher(str);
    final List<String> output = new ArrayList<String>();
    while (matcher.find()) {
        output.add(matcher.group());
    }
    return output;
}

One way I try to keep my regexes readable is their names. I think DIGIT_OR_NONDIGIT_STRING conveys pretty well what I (the programmer) think it does, and testing should make sure that it really does what it's meant to do.

public static void main(String[] args) {
    System.out.println(digitsAsElements( "34A312O5MNI444123A"));
    System.out.println(digitsAsElementsR("34A312O5MNI444123A"));
}

prints:

[34, A, 312, O, 5, MNI, 444123, A]
[34, A, 312, O, 5, MNI, 444123, A]
Michael Myers
+4  A: 

Would you be willing to use regexes if it meant solving the problem in one line of code?

// Split at any position that's either:
// preceded by a digit and followed by a non-digit, or
// preceded by a non-digit and followed by a digit.
String[] parts = str.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");

With the comment to explain the regex, I think that's more readable than any of the non-regex solutions (or any of the other regex solutions, for that matter).

Alan Moore
+1, that is truly slick! Well done sir.
e5