views:

122

answers:

3

I've loaded an HTML doc into a string with .NET. I have this REGEX which I can use to match URLs and replace them, but I need only to match ONLY URLs that are NOT fully qualified.

If this is my string:

djdjdjdjdjdj src="www.example.com/images/x.gif" dkkdkdkdk src="/images/x.gif

My result result would look like this:

djdjdjdjdjdj src="subdomain.example.com/images/x.gif" dkkdkdkdk src="http://www.example.com/images/x.gif

My thinking is I need a REGEX that will match strings that start with src or href and that do not have more than one period. This Regex matches links that have at least one period so it's not matching them correctly.

(src|href)\=(\"(.+?)[\.](.+?)\")

Thanks for any info. I'm coding this in C# but only need the REGEX

+1  A: 

Warning : HTML + regex = round peg + square hole

That being said, here's the hammer you requested

(src|href)\=(\"[^."]*\.?[^."]\")
Zen
+3  A: 

I would suggest you try to use something like the HTML Agility parser, as reccomended many times on this site: http://stackoverflow.com/questions/100358/looking-for-c-html-parser

Also it wouldnt hurt to read this obscure blog entry by some Metallica fan before you start.

Tj Kellie
A: 

Zen,

I tried this and it's not working for me: (src|href)\=(\"[^."]*.?[^."]\")

Tested it here: http://www.regextester.com/

with this test string and it did not find a match either.

dkdkdkdkdkd src="http://www.google.com/image/x.gif" dkdkdkdkdkdkd dkdkdkdkdkd src="/image/x.gif" dkdkdkdkdkdkd href="me.google.com/image/x.gif"

I'll check the HTML parser, but need a coded solution in .NET

Also in response to urls with more subdomains (and periods), they would qualify as having more than 1 period and would not match as expected.

Thank You!

jc