views:

124

answers:

3

I'm trying to understand regex as much as I can, so I came up with this regex-based solution to codingbat.com repeatEnd:

Given a string and an int N, return a string made of N repetitions of the last N characters of the string. You may assume that N is between 0 and the length of the string, inclusive.

public String repeatEnd(String str, int N) {
  return str.replaceAll(
    ".(?!.{N})(?=.*(?<=(.{N})))|."
      .replace("N", Integer.toString(N)),
    "$1"
  );
}

Explanation on its parts:

  • .(?!.{N}): asserts that the matched character is one of the last N characters, by making sure that there aren't N characters following it.
  • (?=.*(?<=(.{N}))): in which case, use lookforward to first go all the way to the end of the string, then a nested lookbehind to capture the last N characters into \1. Note that this assertion will always be true.

  • |.: if the first assertion failed (i.e. there are at least N characters ahead) then match the character anyway; \1 would be empty.

  • In either case, a character is always matched; replace it with \1.

My questions are:

  • Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
  • Is there a simpler regex-based solution?

Bonus question

Do repeatBegin (as analogously defined).

I'm honestly having troubles with this one!

+1  A: 

Whoa, that's some scary regex voodoo there! : )

  • Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)

Yes, that is perfectly valid in most PCRE implementations I know of.

  • Is there a simpler regex-based solution?

I didn't spend too much time on it, but I don't quickly see how that could be simplified or shortened with a single regex replacement.

Bart Kiers
Any reference on your answer for part 1?
polygenelubricants
No reference. I just know it works because I've tried and used it in the past. Why shouldn't it be legal? I wouldn't know where to find a reference that told be I could use a group inside another one, I just know I can.
Bart Kiers
Regarding your voodoo comment: to fully understand something, you must not fear it =)
polygenelubricants
Agreed, and it's nice gymnastics for the brain as well! :)
Bart Kiers
+2  A: 

Nice one! I don't see a way to significantly improve on that regex, although I would refactor it to avoid the needless use of negative logic:

".(?=.{N})|.(?=.*(?<=(.{N})))"

This way the second alternative is never entered until you reach the final N characters, which I think makes the intent a little clearer.

I've never seen a reference that says it's okay to nest lookarounds, but like Bart, I don't see why it wouldn't be. I sometimes use lookaheads inside lookbehinds to get around limitations on variable-length lookbehind expressions.


EDIT: I just realized I can simplify the regex quite a bit by putting the alternation inside the lookahead:

".(?=.{N}|.*(?<=(.{N})))"

By the way, have you considered using format() to build the regex instead of replace()?

return str.replaceAll(
  String.format(".(?=.{%1$d}|.*(?<=(.{%1$d})))", N),
  "$1"
);
Alan Moore
Good point! (15 chars...)
Bart Kiers
+1! Good job! Do check out my answer, though: it uses only 2 assertions, `?=` and `?<=`. Also check out my other regex/codingbat question (`wordEnds`). It just got me a Tumbleweed.
polygenelubricants
@Alan: I have used `format`, but I think I prefer `replace` because it makes the regex part more readable, and almost an independent entity; with `format`, you can't just pass around the regex without specifying the arguments to `format` since their order matters. You can also see multiple occurrences more easily (e.g. if there are multiple `N`, `M`). Using `replace` also puts the before->after i.e. var->value mapping side by side, which I prefer.
polygenelubricants
A: 

Is there a simpler regex-based solution?

It took me a while, but eventually I managed to simplify the regex to:

"(?=.{0,N}$(?<=(.{N}))).|." // repeatEnd
          -or-
".(?<=^(?=(.{N})).{0,N})|." // repeatBegin

Like Alan Moore's answer, this removes the negative assertion, but doesn't even replace it with a positive one, so it now only has 2 assertions instead of 3.

I also like the fact that the "else" case is just a simple .. I prefer to put the bulk of my regex into the "working" side of the alternation, and keep the "non-working" side as simple as possible (usually a simple . or .*).

polygenelubricants
When I mentioned needless use of negative logic, I wasn't referring to the negative lookahead *per se*, but to the contortions you went through to get that lone `.` all by itself in the second alternative. That may seem clearer to you, but to my eye it has the opposite effect.
Alan Moore
It's definitely a matter of personal preference. I'd always prefer `if(!a I'm honestly having a hard time with it. I have a solution that works, but it's quite different from `repeatEnd`.
polygenelubricants
`"(?<=.{N}|^(?=(.{N})).{0,N})."` seems to work. It's essentially the mirror image of my `repeatEnd` regex, except I had to use `{0,N}` instead of `*` to satisfy Java's "odious maximum length" requirement, plus `^` to force the lookbehind to go all the way back.
Alan Moore
By the way, are you aware of this feature? http://meta.stackoverflow.com/questions/43019/how-do-comment-replies-work I probably would have gotten back to you sooner if you had taken advantage of it.
Alan Moore