tags:

views:

234

answers:

2

I'm being lazy tonight and don't want to figure this one out. I need a regex to match 'jeremy.miller' and 'scottgu' from the following inputs:

http://codebetter.com/blogs/jeremy.miller/archive/2009/08/26/talking-about-storyteller-and-executable-requirements-on-elegant-code.aspx

http://weblogs.asp.net/scottgu/archive/2009/08/25/clean-web-config-files-vs-2010-and-net-4-0-series.aspx

Ideas?

Edit

Chris Lutz did a great job of meeting the requirements above. What if these were the inputs so you couldn't use 'archive' in the regex?

 http://codebetter.com/blogs/jeremy.miller/
 http://weblogs.asp.net/scottgu/
+5  A: 

Try this one:

/\/([\w\.]+)\/archive/
RaYell
Damn, beat me by just a few seconds. +1
Chris Lutz
When tested here http://www.regexlib.com/RETester.aspx this one didn't work.
John Sheehan
It does work. You just need to remove first and last `/` if you are using that tool. I'm using PERL notation here to mark beginning and end of the regular expression.
RaYell
RaYell - What's PERL? I know Perl is a language, and `perl` is the interpreter for that language, but I'm not familiar with PERL.
Chris Lutz
+7  A: 

Would this be what you're looking for?

'/([^/]+)/archive/'

Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.

EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:

'/([^/]+)/$'

If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:

'/(jeremy\.miller|scottgu)/'

As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:

'http://[^/]+/(?:blogs/)?([^/]+)/'

This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.

Wow. That's a lot of talking about regexes. I need to shut up and post already.

Chris Lutz
better then mine, faster answer, good explanation. +1 (and tnx for codeblock comment)
AlberT
Works so far. Editing my question with an additional case I didn't think of.
John Sheehan
Thanks. You got me down the right path. I created the pattern on the fly, injecting the host name (which I already had extracted) and then optionally matching /blogs/. Final result: `{0}/(blogs/)*([^/]+)/` with {0} being replaced by the host. Thanks for all your effort, you saved me a lot of time and supported my laziness, which I always appreciate :)
John Sheehan
"The three chief virtues of a programmer are: Laziness, Impatience and Hubris." -- Larry Wall.
Chris Lutz