tags:

views:

133

answers:

2

I want to "fix" all these URLs so, that all of the following URLs are valid (All the URLS are from the Domain "example.com":

before:

httpprache.htm
tech-z.htm
bla/blubbb.html
/suchen/bildung/schulen/abend.htm
/suchen/bildung/schulen/beruf.htm
www.google.de
http://www.google.com/asdf.html
https://blabla.com/

after:

http://example.com/httpprache.htm
http://example.com/tech-z.htm
http://example.com/bla/blubbb.html
http://example.com//suchen/bildung/schulen/abend.htm
http://example.com//suchen/bildung/schulen/beruf.htm
http://www.google.de
http://www.google.com/asdf.html
https://blabla.com/

How can I do this with one or more regular expressions?

+2  A: 

In order for us to help you with this task, you will have to be more precise on when it should be fixed to the domain example.com and when the url only should be corrected to another domain. As I see it now, you could simply check wether www, http or https "(www|http|https)" is in the url. If none of them are, you can prefix the string with "http://example.com/".

If 'www' however is in the string and 'http' is not, you can add 'http://' to the start of the string.

This is, howevere, all deducted from the few examples you provided, there may be many thousands other cases to look out for, proving that this might turn into a quite elaborate task.

Håkon
+1  A: 

with sed-style regex :

cat file-with-urls | sed 's/^(www.[^.]+.[a-z]+$)/http:\/\/\1/'|sed '/^https\?:\/\//!s/^/http:\/\/example.com\//'

1st one :

if string starts with "www.", a sequence of letters, a single dot, then one or more letters, add http:// on the front

2nd one:

if string doesn't start with http:// or https://, put http://example.com/ on the front

matja