views:

8195

answers:

6

I'm trying to use sed to clean up lines of URLs to extract just the domain..

e.g., from:

http://www.suepearson.co.uk/product/174/71/3816/

I want:

http://www.suepearson.co.uk/

(either with or without the trainling slash, it doesn't matter)

I have tried:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

and (escaping the non greedy quantifier)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

but I can not seem to get the non greedy quantifier to work, so it always ends up matching the whole string.

+5  A: 

Try [^/]+ instead of .*?:

sed 's|\(http://[^/]*/\).*|\1|g'
Gumbo
sed 's|\(http:\/\/[^\/]+\)|\1|' still spews out the whole thing.
Joel
@Joel: edited version should work.
chaos
+17  A: 

Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need PCRE. Fortunately that's pretty easy to get:

perl -pe 's|(http://.*?/).*|\1|'
chaos
Works perfectly.
Joel
+1  A: 
sed 's|(http:\/\/[^\/]+\/).*|\1|'
Lucero
A: 

another way, not using regex, is to use fields/delimiter method eg

string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"
ghostdog74
A: 

sed does not support "non greedy" operator.

You have to use "[]" operator to exclude "/" from match.

sed 's,\(http://[^/]*\)/.*,\1,'

P.S. there is no need to backslash "/".

andcoz
A: 

sed -E interprets regular expressions as extended (modern) regular expressions

stepancheg