views:

102

answers:

5

I want to match an optional tag at the end of a line of text.

Example input text:

The quick brown fox jumps over the lazy dog {tag}

I want to match the part in curly-braces and create a back-reference to it.

My regex looks like this:

^.*(\{\w+\})?

(somewhat simplified, I'm also matching parts before the tag):

It matches the lines ok (with and without the tag) but doesn't create a back-reference to the tag.

If I remove the '?' character, so regex is:

^.*(\{\w+\})

It creates a back-reference to the tag but then doesn't match lines without the tag.

I understood from http://www.regular-expressions.info/refadv.html that the optional operator wouldn't affect the backreference:

Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.

but must've misunderstood something.

How do I make the tag part optional and create a back-reference when it exists?

+3  A: 

It is not a backreference problem, the problem is that the regular expression was satisfied by just reading in the text that matched .*. It didn't feel compelled to continue reading to read the optional end-tag. The simplest solution if you're truly reading to the end of the line is to append a $ (dollar sign) to force the regular expression to match the whole line.

edit

BTW, I didn't take your reg-ex literally since you said it matches other stuff, but just to be clear .* will consume the whole line. You'd need something like [^{]* to prevent the tag from getting swallowed. I'm guessing that's not a problem for you.

David Gladfelter
`.*` *does* already match the whole line.
Gumbo
@Gumbo, @padriagf said in his question that the regular expression in question was more complicated, so it may or may not consume the tag. I tried to make it clear that he needs to check for that.
David Gladfelter
+2  A: 

In addition to what others have explained, you might want to make the .* "lazy":

^.*?(\{\w+\})?
Toby
Or "not greedy" is another term for `*?` - Which always tries to find the shortest possible match ( as opposed to the longest possible match )
gnarf
This won't actually work, though—see my answer. The lazy quantifier will match nothing. You can [try it for yourself](http://www.rubular.com/r/hjknUHyFQ7).
Antal S-Z
+1  A: 

As David Gladfelter said, the actual problem is that when you make it optional, it doesn't match; however, his proposed fix won't work. Edit 1: You'll need to use what he put in his edit (which got written as I was writing this). The problem is that quantifiers (*, +, ?, {n,m}) are greedy: they always match as much as they possibly can. Thus, when you write ^.*(\{\w+\})?, the .* will always match the whole line, because an empty match satisfies the optional group. Also note that although ? is greedy, the first greediness (of .*) takes precedence. If you're only allowed to have curly brackets around that optional group, then you can solve your problem by saying so explicitly: ^[^\{]*(\{\w+\})?. This way, the first chunk will match everything up to the first curly bracket, and then (since ? is greedy) match the curly-bracketed word if it can.

Often, another way to solve this is to make the quantifiers lazy (or non-greedy, minimal, etc.) by appending a ?: *?, +?, ??, and {n,m}?. However, this won't help you here: instead, if you do ^.*?(\{\w+\})?, the lazy .*? will try to match zero characters, succeed, and then the optional group won't match. Still, though it won't work here, it's a useful tool in your toolbox. Edit 1: Also, note that these aren't available in all regex engines, although they are available in C#.

Antal S-Z
+1  A: 

Thanks guys. I used a combination of answers, the not-greedy modifier and the end-of-line match, which seems to do the trick, so regex is now:

^.*?(\{\w+\})?$ 

I didn't want to use [^{]* for the first part of the match, as non-tag curly brackets may appear here, but tags will always be at the end of the line.

Thanks for the answers, they were all helpful.

A: 

If you're only interested in the tag, and doesn't care about the rest of the string, then you'd make your life much easier by just matching the tag with this regex (see it on rubular.com):

\{(\w+)\}$

That is, you're trying to match some {word} at the end of the string. If it's not there, then too bad, there's no match. There is no need for a ? modifier or a reluctant .* and all that stuff.

In C#, you may even want to use RegexOptions.RightToLeft, since you're trying to match a suffix anyway, so perhaps something like this:

string[] lines = {
  "The quick brown fox jumps over the lazy dog",
  "The quick brown fox jumps over the lazy dog {tag}",
  "The quick brown fox jumps over the {lazy} dog",
  "The quick brown fox jumps over the {lazy} {dog}",
};

Regex r = new Regex(@"\{(\w+)\}$", RegexOptions.RightToLeft);

foreach (string line in lines) {
  Console.WriteLine("[" + r.Match(line).Groups[1] + "]");
}

This prints (as seen on ideone.com):

[]
[tag]
[]
[dog]
polygenelubricants