tags:

views:

66

answers:

3

I have a few hundred static HTML files that need to be processed.

They contain links like this

 <a href="http://www.mysite.com/"&gt;Link&lt;/a&gt;

I need to add ?ref=self to any url that begins with http://www.mysite.com and becomes

<a href="http://www.mysite.com/?ref=self"&gt;Link&lt;/a&gt;

however, I do not know whether it's going to be http://www.mysite.com or http://www.mysite.com/ also it could be linked to a sub directory.

What's the most efficient way to do this? in C#

+1  A: 

Parsing HTML can be tricky as HTML often contains poorly formed tags and attributes. I suggest looking into an existing HTML parsing library to do your heavy lifting, or, using XSLT to transform valid (x)HTML to your desired output.

This question http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c has some good links to HTML parsing libraries for C#.

jscharf
A html parsing library is like taking a cannon to a duck hunt in this case.
jgauffin
@jgauffin, I don't see how. It's definitely an appropriate solution.
strager
Because the URI's are quite easy to find and replace in this case.
jgauffin
+1  A: 

What's the most efficient way to do this? in C#

  1. Look for the string http://www.mysite.com.
  2. If it doesn't exist, go to 7.
  3. Look for the next ".
  4. If it doesn't exist, error.
  5. Insert ?ref=self before the ".
  6. Go to 1.
  7. Return.

This can be accomplished with the following regular expression substitution:

s#http://www.mysite.com[^"]*#&amp;?ref=self#g

A nicer (more expressive) way would be to use an HTML parser and XPath.

strager
Bug: The `href` attribute could be in single quotes ☺
Timwi
@Timwi, That's not a bug. The OP clearly stated what the expected input is (which didn't include `'`), and that efficiency was a factor (so they say...).
strager
@strager: I don’t see where he stated that. The OP clearly stated that the expected input is **HTML**. He neither stated that it is a special subset of HTML, nor did he state that his examples are exhaustive. If I hadn’t commented, he might not have realised that any of his HTML files could contain href attributes with single quotes and that your algorithm would silently skip them.
Timwi
A: 

You could use Page.Request.UrlReferrer to detect where the request came from.

bjhamltn