ansaurus

Question

Answer 1

+1 A:

Parsing HTML can be tricky as HTML often contains poorly formed tags and attributes. I suggest looking into an existing HTML parsing library to do your heavy lifting, or, using XSLT to transform valid (x)HTML to your desired output.

This question http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c has some good links to HTML parsing libraries for C#.

jscharf 2010-08-22 05:39:35

A html parsing library is like taking a cannon to a duck hunt in this case.

jgauffin 2010-08-22 06:20:00

@jgauffin, I don't see how. It's definitely an appropriate solution.

strager 2010-08-22 08:28:46

Because the URI's are quite easy to find and replace in this case.

jgauffin 2010-08-22 09:55:16

Answer 2

+1 A:

What's the most efficient way to do this? in C#

Look for the string http://www.mysite.com.
If it doesn't exist, go to 7.
Look for the next ".
If it doesn't exist, error.
Insert ?ref=self before the ".
Go to 1.
Return.

This can be accomplished with the following regular expression substitution:

s#http://www.mysite.com[^"]*#&amp;?ref=self#g

A nicer (more expressive) way would be to use an HTML parser and XPath.

strager 2010-08-22 05:43:22

Bug: The `href` attribute could be in single quotes ☺

Timwi 2010-08-22 05:46:46

@Timwi, That's not a bug. The OP clearly stated what the expected input is (which didn't include `'`), and that efficiency was a factor (so they say...).

strager 2010-08-22 05:58:26

@strager: I don’t see where he stated that. The OP clearly stated that the expected input is **HTML**. He neither stated that it is a special subset of HTML, nor did he state that his examples are exhaustive. If I hadn’t commented, he might not have realised that any of his HTML files could contain href attributes with single quotes and that your algorithm would silently skip them.

Timwi 2010-08-22 12:58:53

Answer 3

A:

You could use Page.Request.UrlReferrer to detect where the request came from.

bjhamltn 2010-08-22 08:55:12

ansaurus

tags:

views:

answers:

Processing HTML document with C#

related questions