ansaurus

Question

Regex to find bad URLs in a database field

Answer 1

+3 A:

"http[^"]+http

Uh Clem 2009-09-25 16:57:44

Answer 2

A:

If you can use the *.? syntax, you can just look for the following:

http(.*?)http

and if its present, reject the url.

eykanal 2009-09-25 16:57:47

Answer 3

A:

The string that begins with http and has another http before a quote is:

^http[^"]*http

But, although this answers exactly your question I suspect you may want Uh Clem's answer instead ;-)

Michael Krelin - hacker 2009-09-25 16:59:04

Answer 4

A:

You will probably want something like this:

("http[^"]+)(http)

Then compare the two and if \1 === " + \2 then replace them.

One thought; do you have any query strings in any of your urls. If you do, are any of them like this "http://someurl.com?http=somemoredatahttp://someurl.com?http=somemoredata"?

If so, you will want something far more complicated.

Sean Vieira 2009-09-25 17:05:38

Answer 5

+1 A:

http://www.example.com/apage.htmlhttp://www.example.com/apage.html

This is actually a valid URL! So you'd want to be a bit careful not to munge any other URLs that happen to have ‘http://’ in the middle of them. To detect only a ‘doubled’ URL you could use backreferences:

"(https?://[^"]*)\1"

(This is a non-standard regex feature, but most modern implementations have it.)

Using regex to process HTML is a bad idea. HTML cannot reliably be parsed by regex.

bobince 2009-09-25 17:47:26

ansaurus

tags:

views:

answers:

Regex to find bad URLs in a database field

related questions