views:

67

answers:

4

Suppose I want to turn this :

http://en.wikipedia.org/wiki/Anarchy

into this :

en.wikipedia.org

or even better, this :

wikipedia.org

Is this even possible in regex?

A: 

Sure all you would have to do is search on http://(.*)/wiki/Anarchy

In Perl (Sorry I don't know Ruby, but I expect it's similar)

$string_to_search =~ s/http:////(.)//. should give you wikipedia.org to get rid of the en, you can simply search on http:////en(.)//......

That should do it.

Update: In case you're not familiar with Regex, I would recommend picking up a Regex book, this one really rocks and I like it: REGEX BOOK,Mastering Regular Expressions, I saw it on half.com the other day for 14.99 used, but to clarify what i suggested above, is to look for the string http://en, then for anything until you find a / this is all captured in $1 (in perl, not sure if it's the same in ruby), a simple print $1 will print the string.

Update: #2 sorry the star in the regex is not showing up for some reason, so where you see the . in the () and after the // just imagine a *, oh and I forgot for the en part add a /. at the end that way you don't end up with .wikipedia.org

onaclov2000
+7  A: 

Why use a regex when Ruby has a library for it? The URI library:

ruby-1.9.1-p378 > require 'uri'
 => true 
ruby-1.9.1-p378 > uri = URI.parse("http://en.wikipedia.org/wiki/Anarchy")
 => #<URI::HTTP:0x000001010a2270 URL:http://en.wikipedia.org/wiki/Anarchy&gt; 
ruby-1.9.1-p378 > uri.host
 => "en.wikipedia.org" 
ruby-1.9.1-p378 > uri.host.split('.')
 => ["en", "wikipedia", "org"] 

Splitting the host is one way to separate the domains, but I'm not aware of a reliable way to get the base domain -- you can't just count, in the event of a URL like "http://somedomain.otherdomain.school.ac.uk" vs "www.google.com".

Mark Rushakoff
+1 regex is not always the answer.
Jeff
+2  A: 

/http:\/\/([^\/]*).*/ will produce en.wikipedia.org from the string you provided.

/http:\/\/.{0,3}\.([^\/]*).*/ will produce wikipedia.org.

JGB146
/^http:\/\/\w+\.(\w+\.\w+)/ would limit it more to the type of characters you'd want.
sventech
True. I might leave off the ^ though. If we're making it as general as possible, it might be better to match on any occurrence of a url
JGB146
+1  A: 

yes

Now I know you haven't asked for how, and you haven't specified a language, but I'll answer anyway... (note, this works for all language subsites, not just en.wikipedia...)

perl:

$url =~ s,http://[a-z]{2}\.(wikipedia\.org)/.*,$1,;

ruby:

url = url.sub(/http:\/\/[a-z]{2}\.(wikipedia\.org)\/.*/, '\1')

php: $url = preg_replace('|http://[a-z]{2}.(wikipedia.org)/.*|, '$1', $url);

Of course, for this particular example, you don't even need a regex, just this will do:

url = 'wikipedia.org'

but I jest...

you probably want to handle any URL and pull out the domain part, and it should also work for domains in different countries, eg: foo.co.uk.

In which case, I'd use Mark Rushakoff's solution to get the hostname and then a regex to pull out the domain:

domain = host.sub(/^.*\.([^.]+\.[^.]+(\.[a-z]{2})?)$/, '\1')

Hope this helps

Also, if you want to learn more, I have a regex tute online: http://tech.bluesmoon.info/2006/04/beginning-regular-expressions.html

bluesmoon
Interesting, I didn't know you could use , instead of / for perl in regex, also what does the {2} mean?
onaclov2000
you can use any character that isn't a-z A-Z 0-9 or whitespace. You could even use nested braces like this: `s{http://[a-z]{2}\.(wikipedia\.org)/.*}{$1};`you're not even required to use the same braces for the match and the replace: `s{http://[a-z]{2}\.(wikipedia\.org)/.*}<$1>;`the `{2}` means match exactly twice.
bluesmoon