ansaurus

Question

Answer 1

+5 A:

Try this one:

/\/([\w\.]+)\/archive/

RaYell 2009-08-27 06:33:45

Damn, beat me by just a few seconds. +1

Chris Lutz 2009-08-27 06:36:09

When tested here http://www.regexlib.com/RETester.aspx this one didn't work.

John Sheehan 2009-08-27 06:43:35

It does work. You just need to remove first and last `/` if you are using that tool. I'm using PERL notation here to mark beginning and end of the regular expression.

RaYell 2009-08-27 06:56:00

RaYell - What's PERL? I know Perl is a language, and `perl` is the interpreter for that language, but I'm not familiar with PERL.

Chris Lutz 2009-08-27 06:57:21

Answer 2

+7 A:

Would this be what you're looking for?

'/([^/]+)/archive/'

Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.

EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:

'/([^/]+)/$'

If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:

'/(jeremy\.miller|scottgu)/'

As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:

'http://[^/]+/(?:blogs/)?([^/]+)/'

This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.

Wow. That's a lot of talking about regexes. I need to shut up and post already.

Chris Lutz 2009-08-27 06:34:52

better then mine, faster answer, good explanation. +1 (and tnx for codeblock comment)

AlberT 2009-08-27 06:43:24

Works so far. Editing my question with an additional case I didn't think of.

John Sheehan 2009-08-27 06:44:12

Thanks. You got me down the right path. I created the pattern on the fly, injecting the host name (which I already had extracted) and then optionally matching /blogs/. Final result: `{0}/(blogs/)*([^/]+)/` with {0} being replaced by the host. Thanks for all your effort, you saved me a lot of time and supported my laziness, which I always appreciate :)

John Sheehan 2009-08-27 07:03:08

"The three chief virtues of a programmer are: Laziness, Impatience and Hubris." -- Larry Wall.

Chris Lutz 2009-08-27 07:10:49

ansaurus

tags:

views:

answers:

Regex to extract part of a url

Edit

related questions