ansaurus

Question

Answer 1

+1 A:

I really wouldn't. I'd throw the whole thing away and redo it as readable procedural code.

There are some things you really shouldn't do with regular expressions. This is one of them. I'm all for educating yourself but do you really think that's going to come in handy at some point?

Perhaps you'd be better off learning something that will actually be usable and maintainable :-)

paxdiablo 2010-07-07 09:31:18

Should be a comment, but programming isn't all about making money, it's also about having fun. Learning is fun for me.

polygenelubricants 2010-07-07 09:33:35

And what did you learn today? I learnt (or more correctly, re-enforced my belief) that REs are a good tool but they're not fit for everything.

paxdiablo 2010-07-07 09:35:42

@paxdiablo: Hopefully I'll learn from answers, not comments. I'm hoping Alan Moore will come and save the day. Or perhaps I'll have my own breakthrough.

polygenelubricants 2010-07-07 09:37:25

What *I've* learned so far is that I haven't had enough coffee yet. :D

Alan Moore 2010-07-07 15:06:29

Answer 2

+1 A:

I am not a Java guy, therefore my answers is based on the .NET Regex implementation. I used

"(?<=(^.*)\\G.*)(?<=\\G\\1.)"

based on the fact that \sum_{i=0}^{n} 2^n = 2^{n+1} - 1. It basically reads "Match every position, for which the part after the last match is one longer than the part before the last match."

Its about twice as fast as your original (on .NET, again), taking less than 2 seconds to split 10.000 characters, and I'd argue that its a little more readable. Well... less unreadable. =)

Cheers! Nice question! =)

Edit: Looking at your Regex again, it seems to me that you are using the same approach, but in a more complicated manner. I admit that I have not tried to read yours before trying to make my own solution, both because I like a challenge and because your regex is quite unreadable. =) Are these nested lookarounds neccessary because of the Java regex engine?

Jens 2010-07-07 10:25:18

Yep, this is exactly the same matching/arithmetic logic that I have, but more concise because .NET officially supports infinite lookbehind. This doesn't compile in Java (http://ideone.com/wcDRQ). +1 for effort, though. Java also doesn't support backreferences in lookbehind (see: http://stackoverflow.com/questions/2734977/backreferences-in-lookbehind), but you can use it in a lookahead nested inside a lookbehind. See? NOW we're all learning!! =)

polygenelubricants 2010-07-07 10:35:27

Answer 3

A:

These are patterns that have worked for me in Java. I'll eventually revise everything into one comprehensive answer with FULL explanations. These are all Java string literal representations.

P000: "(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)"
- Original pattern
- 500 in 2.23s (ideone.com), 666 in 4.1s (ideone.com)
P001: "(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2.\\1)\\G.*)"
- Essentially the same as P000, but instead of ^\2\G\2.\1, we cut off the head to just \G\2.\1
- 500 in 2.21s (ideone.com)
P002: "(?=(.*))(?<=(?<=(^.*))(?=\\2.\\1)\\G.*)"
- Much slower than P000 but shorter
- Essentially a "refactoring" of P001 now that both lookbehinds are anchored at \G
- 400 in 3.05s (ideone.com)

polygenelubricants 2010-07-07 11:27:25

ansaurus

tags:

views:

answers:

Regex optimization question

related questions