views:

137

answers:

1

I'm trying to solve wordEnds from codingbat.com using regex.

Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.

wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
wordEnds("XYXY", "XY") → "XY"

This is the simplest as I can make it with my current knowledge of regex:

public String wordEnds(String str, String word) {
  return str.replaceAll(
     ".*?(?=word)(?<=(.|^))word(?=(.|$))|.+"
       .replace("word", java.util.regex.Pattern.quote(word)),
     "$1$2"
  );
}

replace is used to place in the actual word string into the pattern for readability. Pattern.quote isn't necessary to pass their tests, but I think it's required for a proper regex-based solution.

The regex has two major parts:

  • If after matching as few characters as possible ".*?", word can still be found "(?=word)", then lookbehind to capture any character immediately preceding it "(?<=(.|^))", match "word", and lookforward to capture any character following it "(?=(.|$))".
    • The initial "if" test ensures that the atomic lookbehind captures only if there's a word
    • Using lookahead to capture the following character doesn't consume it, so it can be used as part of further matching
  • Otherwise match what's left "|.+"
    • Groups 1 and 2 would capture empty strings

I think this works in all cases, but it's obviously quite complex. I'm just wondering if others can suggest a simpler regex to do this.

Note: I'm not looking for a solution using indexOf and a loop. I want a regex-based replaceAll solution. I also need a working regex that passes all codingbat tests.


I managed to reduce the occurrence of word within the pattern to just one.

".+?(?<=(^|.)word)(?=(.?))|.+"

I'm still looking if it's possible to simplify this further, but I also have another question:

  • With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
A: 

I am working in .Net's regex but I was able to change your pattern to:

.+?(?<=(\w?)word)(?=(\w?))|.+

with the positive results. You know its a word (alphanumeric) type character, why not give a valid hint to the parser of that fact; instead of any character its an optional alpha numeric character.

It may answer why you don't need to specify the anchors of ^ and $....for what exactly is $ is it \r or \n or other? (.Net has issues with $, and maybe you are not exactly capturing a Null of $, but the null of \r or \n which allowed you to change to .? for $)

HTH

OmegaMan