views:

170

answers:

2

I am writing a program that will help me find out sites are my competitors linking to.

In order to do that, I am writing a program that will parse an HTML file, and will produce 2 lists: internal links and external links.

I will use the internal links to further crawl the website, and the external links are actually what I am looking for.

How, using .NET RegEx, do I parse an HTML file and find 1. External links. 2. Internal links.

Thanks in advance, Eytan Levit.

Edit: In response to the question - no - I am not bound to regex, i can use any other ideas.

+8  A: 

Don't use a regular expression for this.

Use something like the HTML Agility Pack which is specifically designed for parsing HTML. (There's even an example on their CodePlex homepage which finds all links in a page.)

LukeH
Is it efficient?(i mean one day i do plan to turn my application into a real web spider)
Eytan Levit
+1 for HTML Agility Pack, it's very usefull library for processing HTML@Eytan Levit: yes, I use this library in few projects with no problems
TcKs
A: 

i had used Regex for Html parsing it is really fast but now there are better options that will reduce the development cost.

Try Linq To Html it's good, Beth has a great post about it that can be found here

Usman Masood