views:

91

answers:

5

So I'm writing an application that will do a little screen scrapping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this:

<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td> 
The data I want is in here <br /> 
and it's seperated by these annoying <br /> 's.

No id's, classes, or even a single <p> tag. </p> Just a bunch of <br />  tags.
</td> 
</tr> 
</table> 

So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?

Update: Here is how I'm loading my doc

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);
A: 

You'd probably get better mileage with an xml parser.

Josh Sterling
A: 

"Something else" is the best answer -- HTML is best parsed by an HTML parser rather than via regular expressions. I'm no C# expert, but I hear the HTML Agility Pack is well-liked for this purpose.

Alex Martelli
I'm already using that. I updated my answer to reflect that.
Bob Dylan
+1  A: 

I'd say som̡et̨hińg Else

Felipe Alsacreations
Normally I would agree with that too, but I think this is an exception becuase I'm looking for something so narrow. However if you could **actully suggest something else** I would be open to that too.
Bob Dylan
Saw that coming..
BlueRaja - Danny Pflughoeft
+1  A: 

Since you are using Html Agility Pack already I would suggest using the methods it provides to find the information you want. There are a few ways to navigate the document, but one of the most concise is to use XPath. In this case you could use something like this:

HtmlDocument doc = new HtmlDocument();
doc.Load("input.html");
HtmlNode node = doc.DocumentNode
                   .SelectNodes("//table[@cellspacing='3']/tr[2]/td")
                   .Single();
string text = node.InnerText;
Mark Byers
I think your on the right track, but I'm not seeing the `.Single()` method in intellisense. I'm using version 1.4.0 of the HTML Agility Pack.
Bob Dylan
Add a reference to and use System.Data.Linq;
alexn
@alexn: I did that and it's still not showing up.
Bob Dylan
@Bob Dylan: That code was just an example. You don't *have* to use `Single()` if you don't have it available - you could just write `.SelectNodes(...)[0]` instead. Though knowing about Linq would be a huge asset for developing in C#.
Mark Byers
@Mark: Ok I just tried using the `[0]` like you said and got an exception: `node`: "Object reference not set to an instance of an object". I assume this means it didn't find the table, tr, or the td?
Bob Dylan
@Bob Dylan: Correct. You could change the XPath expression to "//table[@cellspacing=3]" and see if that matches.
Mark Byers
@Mark: I tried that and it gave me the same error. Also I've updated my answer to show how I'm loading the document (just in case that makes a difference).
Bob Dylan
Ok. I got it working.
Bob Dylan
A: 

If you're using the Agility pack already, then it's just a matter of using some thing doc.DocumentNode.SelectNodes("//table[@cellspacing='3']") to get the table in the document. Try looking through the documentation and coding examples. Since you already have structured data, it's rediculous to go back to the text data and reparse.

Eclipse