views:

79

answers:

1

Hi,

I have a html file that it isn't syntactically correct, I'm parsing it with HTML Agility Pack (http://htmlagilitypack.codeplex.com).

But if I have a link like

<a href="http://google.com/!/!!!"&gt;Google&lt;/a&gt;

it's a problem, is there a possible way to detect broken links so that when an error is found (no page is available on that link) the application will store that link in a list and return it?

Same problem on tags, example:

<img hhh="jjj"/>

here the image tag is all wrong, this should be in the 'errors for repair' list too.

Thanks in advance. Jeff

+2  A: 

You need to loop through Document.DocumentNode.Descendants("a") and check whether the href tag is bad.

Similarly, you can loop through Document.DocumentNode.Descendants("img") and check for src attributes.

EDIT:

To check for bad attributes, you can maintain a Dictionary<string, IEnumerable<string>> that maps tag names to valid attributes, then use LINQ to find missing attributes, like this:

from tag in Document.DocumentNode.Descendants()
let legalAttributes = allAttributes[tag.TagName]
from attribute in tag.Attributes
where !legalAttributes.Contains(attribute.Name, StringComparer.OrdinalIgnoreCase)
select new { Tag = tag.OuterHtml, Attribute = attribute.Name }
SLaks
I've done that :) but the question was how can i discover that the links are bad or not...
Jeff Norman
You can use the `WebClient` class to request the URL and see if you get an exception.
SLaks
hmmm this is nice, but for the img tags for example? or body tags etc, is there a general way to repair them?
Jeff Norman
I'm not sure what you mean.
SLaks
Why was this downvoted?
SLaks
This solves only a little part of his problem. What do you do about the missing "src" in the image tag or the "hhh" that doesn't belong in there?
Hinek
I mean, for my example with the 'img' tag, the 'hhh' attribute does not exists, how can I detect that this attribute does not exist for the img tag? Or for any tag...
Jeff Norman
You can get a list of valid attributes for each tag, then use LINQ to see if the tag has any bad attributes.
SLaks
Can you update you answer with an example of this? Thank you very much
Jeff Norman
You can be fancier and have an inheritable AttributeValidator class that can check whether an attributes value is valid (eg, URL, number), and store that in a `Dictionary<string, Dictionary<string, AttributeValidator>>` and validate every attribute.
SLaks
This is what I was thinking right now, that in that list of attributes to be only valid entries.
Jeff Norman
Why was this downvoted?
SLaks
I don't know, I've upvoted :)
Jeff Norman