views:

189

answers:

4

All,

I need to write a regular expression to perform the following operations replace

(A)

src ="/folder/image.jpg"

or

src="http://www.mydomain.com/folder/image.jpg"

with

src="/cache/getCacheItem.aspx?source_url=http://www.mydomain.com/folder/image.jpg"

(B)

href="/folder/file.zip"

or

href="http://www.mydomain.com/folder/file.zip"

with

href="/cache/getCaccheItem.aspx?source_url=http://www.mydomain.com/folder/file.zip

I know I can use

(src|href).*?=['|\"](?<url>.*?)['|\"]

with a replace value of

$1="/legacy_integration/cache/getCacheItem.aspx?source_url=$2"

to catch the src=... and href=... attributes. However, I need to filter based on file extension - only match valid image extensions like jpg, png, gif, and only match href extensions like zip and pdf.

Any suggestions? The problem can be summarized as: modify the above expression to match only certain file extensions, and allow the domain http://www.mydomain.com/ to be inserted only if the original url was a relative, thus ensuring that the output text contains the domain exactly once.

Do I need to perform this using two different regular expressions, one for source text including the domain and one without? Or can I somehow use a conditional match statement that, in combination with a replacement expression, will insert the domain or not based on whether the matched text contains the domain?

I know I can perform this using a custom match evaluator, but it seems that it may be faster/more efficient to do it within the regex itself.

Suggestions/comments?

+1  A: 

Does the following expression work?

Regex.Replace(url, 
@"(src|href)\s*=\s*(?:'|")((?:http://www\.mydomain\.com)?.*?(jpg|bmp|png))(?:'|")",
"$1 - /cache/getCacheItem.aspx?source_url=$2");

The idea is that you match the text http://www.mydomain.com conditionally. It will be included as part of the $2 match text. If it was there originally, it will make its way into the replaced string.

David Andres
A: 

This pattern will match any path, if you want constrain a path you can add it after the ?/.

(?<pre>(?:src|href)\W*=\W*(?:"|'))(?<url>(?:http://www\.mydomain\.com)?/(?&lt;file&gt;[^"']+))(?&lt;post&gt;"|')

Here's some sample code:

string pattern = "(?<pre>(?:src|href)\\W*=\\W*(?:\"|'))(?<url>(?:http://www\\.mydomain\\.com)?/(?&lt;file&gt;[^\"']+))(?&lt;post&gt;\"|')";

string test = "src =\"/folder/image.jpg\"\r\n"
            + "src=\"http://www.mydomain.com/folder/image.jpg\"\r\n"
            + "href=\"/folder/file.zip\"\r\n"
            + "href=\"http://www.mydomain.com/folder/file.zip\"";

string replacement = "${pre}/cache/getCacheItem.aspx?source_url=http://www.mydomain.com/${file}${post}";

test = Regex.Replace(test, pattern, replacement);
CptSkippy
A: 

What about this?

var reg = new Regex("(/folder/[^\"]+)");
Match m = reg.Match("src=\"http://www.mydomain.com/folder/image.jpg\"");
var result = string.Format("src=\"/cache/getCacheItem.aspx? source_url=http://www.mydomain.com{0}\"", m.Groups[1].Value);
Esben Skov Pedersen
@Espen P: It looks like this results in URLs that always contain http://www.mydomain.com. From what I gather from the OP, David wants this domain included only if it was present in the original URL.
David Andres
I probably wasn't clear - I want the domain included whether or not it was part of the original URL.
David Lively
+2  A: 

This comes up all the time. Regex is not an appropriate tool to parse a non-regular grammar such as HTML. Use a real parser (like the HTML agility pack) to do this.

annakata
I don't need to parse ALL HTML, just the specified tags. I also have control over the input data and can guarantee that the input text matches the given format. Seems like overkill to involve yet another 3rd party tool here.
David Lively
It's not overkill, it's reliability, and it doesn't matter if you parse all if you parse any. Try it, it'll help solve many problems, not just this one.
annakata
While I appreciate the utterly stable approach, this particular solution as a) working, b) a temporary solution that allows me to present a LOT of legacy ASP content in a new ASP.NET framework, and c) working. As I said, I have control over the input data and can guarantee that my regex works. If I have need of a more general solution in the future, I'll happily explore the agility pack. Thanks. =)
David Lively
Okay, I take it back. The HtmlAgilityPack is sweet.
David Lively