I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.
The page that I'm scraping currently is http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND
The code I'm currently using is pretty quick and dirty...
//This function retrieves data from the digikey
private static List<string> ExtractProductInfo(HtmlDocument doc)
{
List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
List<string> m_unparsedProductInfo = new List<string>();
//Base Node for part info
string m_baseNode = @"//html[1]/body[1]/div[2]";
//Write part info to list
m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
//More lines of similar form will go here for more info
//this retrieves digikey PN
foreach(HtmlNode node in m_unparsedProductInfoNodes)
{
m_unparsedProductInfo.Add(node.InnerText);
}
return m_unparsedProductInfo;
}
Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"
Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.