tags:

views:

246

answers:

7

Why does ".*" and ".+" give different results?

System.out.println("foo".replaceAll(".+", "bar")); // --> "bar"
System.out.println("foo".replaceAll(".*", "bar")); //--> "barbar"

I would expect "bar" for both, since * and + are both greedy and should match the whole String. (The above example is Java, but other Tools, like http://www.gskinner.com/RegExr/ give me the same result)

+10  A: 

You're right about both being greedy but ".*" is matching two strings: the first one is "foo" and the second is "". ".+" will only match "foo".

Both try to match the longest possible string which is "foo". After that, they try to find the longest matching string coming after the previous match. In this phase, ".*" is able to match an empty string while ".+" won't.

Mehrdad Afshari
+1, but I don't understand. Surely there are an infinite number of empty strings in "foo", why does it just pick one?
Skilldrick
that's obvious, the question OP has is why does it match two strings when both of the quantifiers are greedy.
SilentGhost
Mehrdad: I see, thanks.
Skilldrick
Mehrdad: And then why doesn't it try to match the longest possible string *again and again*, always returning the empty string?
Heinzi
@Heinzi: I guess because after the empty string at the end there is nothing - it's reached the end
Skilldrick
Heinzi: It understands that it has already passed the end of string. The point it, they way the Regex engine is implemented is that after matching `"foo"`, is doesn't count it as the end of string.
Mehrdad Afshari
I understand what you mean now, thanks for the explanation. (Still, I consider this strange.)
Heinzi
Heizi: It's not strange. It's the way it should work. If the terminating condition for the regex engine was to reach the end of string, `".*"` would not match `""` which is not what we want. The terminating condition is when it gets past the last character in the string.
Mehrdad Afshari
Heinzi: No, `".*"` is greedy. It won't match `""` when it can match `"foo"`.
Mehrdad Afshari
Mehrdad: Of course, you are right. +1 :-)
Heinzi
The regular expression engine does not revisit a previously visited position in a string. The maximum number of matches you could get from `foo` would be four. Imagine the following replacement: `.{0}` -> `!`; it will turn the string into `!f!o!o!` because it steps every position in the string *once*. This is not to be confused with backtracking which occurs during a step (sub-stepping in a step can visit any location independently of whether it has been visited before). That's why you can match the same string many times: `(?=(.+))*` will match everything after the current position always.
Blixt
Oops, by "match everything after the current position", I meant that it will be stored into the back-reference. The actual match will always be a zero-length string (the position only).
Blixt
A: 

hm, Python in both cases produces 'bar':

>>> import re
>>> re.sub('.+', 'bar', 'foo')
'bar'
>>> re.sub('.*', 'bar', 'foo')
'bar'
SilentGhost
why the downvote?
SilentGhost
+1 to counteract unexplained downvote
Heinzi
Why is this downvoted? Is it wrong? What dows python roduce?
Tim Büthe
I didn't downvote, but I won't upvote either. This is a comment on the question, rather than an answer, in my opinion.
Skilldrick
@Skilldrick: it shows that it's not some universal truth. It's limited to implementation. And I along with OP would consider such behaviour to be a bug.
SilentGhost
becuase while that is really and truly awesome for python, it offers no insight whatsoever to solving the question. (p.s. python's `sub` is explicitly designed not to match empty strings adjacent to non-empty matches).
Cirno de Bergerac
+1  A: 

My guess is that the greedy .* first matches the whole string and then starts looking for a match from the current position (end of string) and matches the empty string before quitting.

Amarghosh
`trace(String("foo").replace(/.*/g, "bar"));` in ActionScript gives `barbar` by the way - So apparently only python is awesome :)
Amarghosh
A: 

That's a really interesting question.

When you think about it, String.replaceAll(...) could logically have been implemented to do one of three things in the ".*" case:

  • do one replacement, giving "bar"
  • do two replacements giving "barbar"
  • try do an infinite number of replacements.

Clearly, the last alternative is not useful, so I can understand why they didn't do that. But we don't know why they chose "barbar" interpretation instead of the "bar" interpretation. The problem is that there is no universal standard for Regex syntax, yet alone Regex semantics. My guess is that the Sun author(s) did one of the following:

  • look at what other pre-existing implementations did and copied,
  • thought about it and did what they thought was best, or
  • didn't consider this edge case, and the current behavior is unintentional.

But at the end of the day, it doesn't really matter WHY they chose "barbar". The fact is that they did ... and we just need to deal with this.

Stephen C
I disagree. The exhibited behavior is very intentional, and well documented. It's perhaps clearer to understand what happens if stepped through a regular expression debugger. It will start at position 0 in the string. It will attempt to match `.` (any character) which matches `f`. Thanks to `*` it will continue matching. This makes it match `foo` and now it's at position 3 (after `o`). Since `replaceAll` keeps replacing after the previous match until there is no match, it will now try again. `.` fails to match, but since it's ok to match 0 times (`*`), it will still succeed. Thus two matches.
Blixt
@Blixt - I understand what it does. I just don't understand why they chose that interpretation when Python chose the other one.
Stephen C
+8  A: 

Mehrdad already explained that it also matches one empty substring at the end of the string. I found an official explanation of this behavior (why match one empty substring instead of an infinite number) in the .net documentation:

http://msdn.microsoft.com/en-us/library/c878ftxe.aspx

Quantifiers *, +, {n,m} (and their "lazy" counterparts) never repeat after an empty match when the minimum number n has been matched. This rule prevents quantifiers from entering infinite loops on empty matches when m is infinite (although the rule applies even if m is not infinite).

For example, (a?)* matches the string "aaa" and captures substrings in the pattern (a)(a)(a)(). Note that there is no fifth empty capture, because the fourth empty capture causes the quantifier to stop repeating.

Heinzi
+1 for digging up documentation
Tim Büthe
+2  A: 

Tested by experiment: replaceAll's matcher won't match twice in the same string position without advancing.

Experiment:

System.out.println("foo".replaceAll(".??", "[bar]"));

Output:

[bar]f[bar]o[bar]o[bar]

Explanation:

The pattern .?? is a non-greedy match of 0 or 1 characters, which means it will match nothing by preference, and one character if forced to. On the first iteration, it matches nothing, and the replaceAll replaces "" with "[bar]" in the beginning of the string. On the second iteration, it would match nothing again, but that's prohibited, so instead one character is copied from the input to the output ("f"), the position is advanced, the match is tried again, etc. so you have bar - f - bar - o - bar - o - bar: one "[bar]" for every distinct place where an empty string can be matched. At the end there's no possibility to advance so the replacement terminates, but only after matching the "final" empty string.

Just for curiosity's sake, Perl does something very similar, but it applies the rule differently, giving an output of "[bar][bar][bar][bar][bar][bar][bar]" for the same input and the same pattern -- .?? is still prohibited from making a zero-width match twice in a row in the same position, but it's allowed to backtrack and match a single character. Meaning it replaces "" with "[bar]", then replaces "f" with "[bar]", then "" with "[bar]" then "o" with "[bar]", etc. until at the end of the string the zero-width match is prohibited and there's no further positive-width match possible.

hobbs
I would not have expected the second output, but most likely this is an question of how `replaceAll` or equivalent is implemented. While the Java version matches empty string at position 0 and starts looking for next match at position 1, Perl most likely starts looking at position 0 again (but this time disallowing an empty match to avoid an infinite loop).
Blixt
Fascinating! In Java, `.??` (as a complete regex) seems to behave exactly like `|` or `\b|\B` or any other regex that matches the empty string at every character boundary. I would argue that Perl's approach is preferable, if only because there's no other way to achieve that result. Just give me a minute to figure out when that result would be useful... :/
Alan Moore
A: 

I think, the first round both patterns (.+ and .*) match all of string ("foo"). After that, remaining input that is empty string will be matched by .* pattern.

However, I found a quite strange result from the following patterns.

^.*  => 'bar'
.*$  => 'barbar'
^.*$ => 'bar'

Can you explain why it returns the above result? What's different between start string (^) and end string ($) in Regular Expression?

Update.1

I try to change input string to the following string.

foo

foo

Please look at new result!

'^.*' =>

bar

foo

'.*$' =>

foo

barbar

So, I think, there is only one beginning string for each input. In the other hand, when function find match string in input string, it does not remove ending string for current current string. PS. You can quickly try it at http://gskinner.com/RegExr/

Soul_Master
I tried that too and it actually makes sense: Both `^.*` and `^.*$` match *only* `foo` but not the empty substring at the end of the string (since there is "foo" in front of it instead of start-of-line. On the other hand, `.*$` *can* match "foo" (because it is followed by end-of-line) as well as the empty string at the end of the line (since it is also followed by end-of-line).
Heinzi
Nothing except for the fact that `.*` is greedy. At the beginning of the string, `^` matches, then `.*` matches, then you're not at the beginning of the string anymore so `^` can't possibly match again. but `.*$` is capable of matching "foo" when the position is the beginning of the string, and then matching "" when the position is the end of the string.
hobbs