views:

40

answers:

2

Hi All,

I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.

The page I am testing is: http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM

Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:

/html/body/table[@id='MainTable']/tbody/tr[1]/td/table[@id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[@id='Home']/tbody/tr[3]/td

When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?

I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!

p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.

+1  A: 

Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.

I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.

Saab
probably related to either the browser or the xpather app, i'm going to check it out sounds interesting.
Anonymous Type
A: 

I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.

When I do

 string test = string.Empty;
StreamReader sr = new StreamReader(@"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = @"//table[@id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;

That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.

Examining the html I couldn't find a /tbody.

Anonymous Type