tags:

views:

87

answers:

1

i have this code in c# to pull links from a web page and wanted to make it smarter in that i want to be able to add small additions in the fuure to exclude links based on 2 criteria.

first i want to exclude certain file extentions found on pages such as links to pdf files or ppt files...

next i want to be able to exclude links on the first part of the url to such things as ftp and images.google... or maps.google.... and mailto...

this is my current code that needs help:

MatchCollection m1 = Regex.Matches(file, @"(?i)(<A[^>]*href\s*=\s*['""](?!mailto|[^'""]*\.(?:pdf|doc|ppt))[^>]*>.*?</A>)", RegexOptions.Singleline);
+1  A: 

Have you considered the Html Agility Pack?

Mitch Wheat
interesting, i am checking this out. thanks.
kacalapy