tags:

views:

101

answers:

3

I have a question about a regex. Given this part of a regex:

(.[^\\.]+)

The part [^\.]+ Does this mean get everything until the first dot? So with this text:

Hello my name is Martijn. I live in Holland.

I get 2 results: both sentences. But when I leave the + sign, I get 2 two characters: he, ll, o<space>, my, etc. Why is that?

+2  A: 

Because a dot outside a character class (ie, not between []) means (almost) any character.

So, .[^\\.] means match (almost) any character followed by something which is not a dot nor a backslash (dots don't need to be escaped in a character class to mean just a dot, but backslashes do),

This, in your example, is h (any character) e (not a dot nor a backslash) and so on and so forth.

Whereas with a + (one or more of not a dot nor a backslash) you will match all characters which are not dots until a dot.

Vinko Vrsalovic
Might be a faulty copy and past since the OP posted both `[^\\.]` and `[^\.]` in his question. But no, `[^\.]` doesn't match (almost) any character except a dot or a backslash. It matches (almost) any character except a dot (the backslash can be omitted in this case). If a backslash should be included, it has to be escaped: `[^\\.]`.
Bart Kiers
Backslash escaping depends heavily on the language/platform. So it can mean both, depending on how does your regex engine (or language feeding it) interprets strings and backslashes
Vinko Vrsalovic
Okay, but in most PCRE implementations, you need to escape the '\', whether it's in- or outside a character class. And I can't imagine ASP (the OP flagged the question as such) would match a backslash using the class: **[\]**
Bart Kiers
Yes, that's correct, .NET's engine behaves as you say.
Vinko Vrsalovic
+1  A: 

The regex means: any one character followed by more than zero characters that are not a backslash or a period.

Aviral Dasgupta
+2  A: 

Your regex .[^\\.]+ means:

  1. Match any character
  2. Match any character until you get slash or a dot ".". Note that [^\\.] means NOT slash or NOT dot, which means either a dot or a slash is not a match. It will keep on matching characters until it founds a dot or slash because of the "+" at the end. It is called a greedy quantifier because of that.

When you input (quotes not included): "Hello my name is Martijn. I live in Holland." The matches are:

  1. Hello my name is Martijn
  2. . I live in Holland

Note that the dot is not included in the first match since it stops at n in Martijn and the second match starts with the dot.

When you remove the +: (.[^\\.]) It just means:

  1. Match any character
  2. Match any character except a dot or a slash.
John