views:

128

answers:

5

So I want to match just the domain from ether:

http://www.google.com/test/
http://google.com/test/
http://google.net/test/

Output should be for all 3: google

I got this code working for just .com

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.com.*$/\1/p"
Output: 'google'

Then I thought it would be as simple as doing say (com|net) but that doesn't seem to be true:

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.(com|net).*$/\1/p"
Output: '' (nothing)

I was going to use a similar method to get rid of the "www" but it seems im doing something wrong… (does it not work with regex outside the \( \) …)

A: 
s|http://(www\.)?([^.]*)|$2|

It's Perl with alternate delimiters (because it makes it more legible), I'm sure you can port it to sed or whatever you need.

Anon.
A: 

Have you tried using the "-r" switch on your sed command? This enables the extended regular expression mode (egrep-compatible regexes).

Edit: try this, it seems to work. The "?:" characters in front of com|net are to prevent this set of characters to be captured by their surrounding parenthesis.

 echo "http://www.google.com/test/" | sed -nr "s/.*www\.(.*)\.(?:com|net).*$/\1/p"
Guillaume Gervais
Yep: user:~# echo "http\://www.google.com/test/" | sed -n -r "s/.*www\.\(.*\)\.(com|net).*$/\1/p"; returns nothing as does "-E" (take out the "\" from the url)
Mint
See my edited reply: since you are in extended regex mode, you don't need to escape your parenthesis to capture characters.
Guillaume Gervais
Thanks! *buys you a beer* (or what ever) :P I always get confused when and where not to use escapes.
Mint
@Mint, does this answer really solve your problem?
This doesn't work for the cases without "www".
Dennis Williamson
Oh wow, I didn't notice that, though I would of when I started using it :P (but it helpful as I could edit his code to solve another of my problems :))
Mint
@Guillaume: The `:?` doesn't seem to work for me: `echo "aaabbbccc"|sed -nr 's/(a*)(:?b*)(c*)/\1 \2/p'` produces "aaa bbb"
Dennis Williamson
(`?:` - note the order - works in Perl, though)
Dennis Williamson
@dennis-williamson: you're right, my bad!
Guillaume Gervais
`?:` gives me an error in `sed`. (by the way, you changed your description but not the command)
Dennis Williamson
A: 
#! /bin/bash

urls=(                        \
  http://www.google.com/test/ \
  http://google.com/test/     \
  http://google.net/test/     \
)

for url in ${urls[@]}; do
  echo $url | sed -re 's,^http://(.*\.)*(.+)\.[a-z]+/.+$,\2,'
done
Greg Bacon
this will not give correct results for url like www.google.com.cn
ghostdog74
@Ghost requirement.
Greg Bacon
+1  A: 

if you have Python, you can use urlparse module

import urlparse
for http in open("file"):
    o = urlparse.urlparse(http)
    d = o.netloc.split(".")
    if "www" in o.netloc:
        print d[1]
    else:
        print d[0]

output

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/

$ ./python.py
google
google
google

or you can use awk

awk -F"/" '{
    gsub(/http:\/\/|\/.*$/,"")
    split($0,d,".")
    if(d[1]~/www/){
        print d[2]
    }else{
        print d[1]
    }
} ' file

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
www.google.com.cn/test
google.com/test

$ ./shell.sh
google
google
google
google
google
ghostdog74
+1  A: 

This will output "google" in all cases:

sed -n "s|http://\(.*\.\)*\(.*\)\..*|\2|p"

Edit:

This version will handle URLs like "'http://google.com.cn/test" and "http://www.google.co.uk/" as well as the ones in the original question:

sed -nr "s|http://(www\.)?([^.]*)\.(.*\.?)*|\2|p"

This version will handle cases that don't include "http://" (plus the others):

sed -nr "s|(http://)?(www\.)?([^.]*)\.(.*\.?)*|\3|p"
Dennis Williamson
this fails on www.google.com.cn for example. Unless OP really doesn't have that kind of url to parse.
ghostdog74
+1 for the second version.
ghostdog74
Ah yes this one works even better! Thanks Dennis, you seem to help me with allot of my questions :) (I should't need www.google.com.cn to work but you never know)
Mint