views:

86

answers:

3

As a personal learning exercise, I wrote this regex to split a unary string into parts whose length is increasing powers of two (see also on ideone.com):

    for (String s :
       new String(new char[500])
       .split("(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)")
    ) {
       System.out.printf("%s ", s.length());
    }
    // prints "1 2 4 8 16 32 64 128 245 "

This uses a combination of capturing during lookarounds, nested lookarounds, matching on backreferences, and infinite length lookbehind (which isn't officially supported in Java but works anyway). Properties of sums of powers of two and the fact that the string has a unary alphabet is also taken advantage of.

This solution is both unreadable and has a horrible performance.

My question is, how would you "optimize" this regex?

  • Can you make the regex more readable (it's okay if it performs worse)
  • Can you make the regex performs better (it's okay if it's less readable)
+1  A: 

I really wouldn't. I'd throw the whole thing away and redo it as readable procedural code.

There are some things you really shouldn't do with regular expressions. This is one of them. I'm all for educating yourself but do you really think that's going to come in handy at some point?

Perhaps you'd be better off learning something that will actually be usable and maintainable :-)

paxdiablo
Should be a comment, but programming isn't all about making money, it's also about having fun. Learning is fun for me.
polygenelubricants
And what did you learn today? I learnt (or more correctly, re-enforced my belief) that REs are a good tool but they're not fit for everything.
paxdiablo
@paxdiablo: Hopefully I'll learn from answers, not comments. I'm hoping Alan Moore will come and save the day. Or perhaps I'll have my own breakthrough.
polygenelubricants
What *I've* learned so far is that I haven't had enough coffee yet. :D
Alan Moore
+1  A: 

I am not a Java guy, therefore my answers is based on the .NET Regex implementation. I used

"(?<=(^.*)\\G.*)(?<=\\G\\1.)"

based on the fact that \sum_{i=0}^{n} 2^n = 2^{n+1} - 1. It basically reads "Match every position, for which the part after the last match is one longer than the part before the last match."

Its about twice as fast as your original (on .NET, again), taking less than 2 seconds to split 10.000 characters, and I'd argue that its a little more readable. Well... less unreadable. =)

Cheers! Nice question! =)

Edit: Looking at your Regex again, it seems to me that you are using the same approach, but in a more complicated manner. I admit that I have not tried to read yours before trying to make my own solution, both because I like a challenge and because your regex is quite unreadable. =) Are these nested lookarounds neccessary because of the Java regex engine?

Jens
Yep, this is exactly the same matching/arithmetic logic that I have, but more concise because .NET officially supports infinite lookbehind. This doesn't compile in Java (http://ideone.com/wcDRQ). +1 for effort, though. Java also doesn't support backreferences in lookbehind (see: http://stackoverflow.com/questions/2734977/backreferences-in-lookbehind), but you can use it in a lookahead nested inside a lookbehind. See? NOW we're all learning!! =)
polygenelubricants
A: 

These are patterns that have worked for me in Java. I'll eventually revise everything into one comprehensive answer with FULL explanations. These are all Java string literal representations.

  • P000: "(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)"
  • P001: "(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2.\\1)\\G.*)"
    • Essentially the same as P000, but instead of ^\2\G\2.\1, we cut off the head to just \G\2.\1
    • 500 in 2.21s (ideone.com)
  • P002: "(?=(.*))(?<=(?<=(^.*))(?=\\2.\\1)\\G.*)"
    • Much slower than P000 but shorter
    • Essentially a "refactoring" of P001 now that both lookbehinds are anchored at \G
    • 400 in 3.05s (ideone.com)
polygenelubricants