views:

35

answers:

5

We had an issue with the text editor on our website that was doubling up the URL. So for example, the text field may look contain:

This is a description for a media item, and here in <a href="http://www.example.com/apage.htmlhttp://www.example.com/apage.html"&gt;a link</a>.

So pretty much I need a regex to detect any string that begins with http and has another http before a closing quote, as in "http://www.example.com/apage.htmlhttp://www.example.com/apage.html"

+3  A: 
"http[^"]+http
Uh Clem
A: 

If you can use the *.? syntax, you can just look for the following:

http(.*?)http

and if its present, reject the url.

eykanal
A: 

The string that begins with http and has another http before a quote is:

^http[^"]*http

But, although this answers exactly your question I suspect you may want Uh Clem's answer instead ;-)

Michael Krelin - hacker
A: 

You will probably want something like this:

("http[^"]+)(http)

Then compare the two and if \1 === " + \2 then replace them.

One thought; do you have any query strings in any of your urls. If you do, are any of them like this "http://someurl.com?http=somemoredatahttp://someurl.com?http=somemoredata"?

If so, you will want something far more complicated.

Sean Vieira
+1  A: 
http://www.example.com/apage.htmlhttp://www.example.com/apage.html

This is actually a valid URL! So you'd want to be a bit careful not to munge any other URLs that happen to have ‘http://’ in the middle of them. To detect only a ‘doubled’ URL you could use backreferences:

"(https?://[^"]*)\1"

(This is a non-standard regex feature, but most modern implementations have it.)

Using regex to process HTML is a bad idea. HTML cannot reliably be parsed by regex.

bobince