views:

652

answers:

2

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.

The page that I'm scraping currently is http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND

The code I'm currently using is pretty quick and dirty...

   //This function retrieves data from the digikey
   private static List<string> ExtractProductInfo(HtmlDocument doc)
   {
       List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
       List<string> m_unparsedProductInfo = new List<string>();

       //Base Node for part info
       string m_baseNode = @"//html[1]/body[1]/div[2]";

       //Write part info to list
       m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
       //More lines of similar form will go here for more info
       //this retrieves digikey PN

       foreach(HtmlNode node in m_unparsedProductInfoNodes)
       {
           m_unparsedProductInfo.Add(node.InnerText);
       }

       return m_unparsedProductInfo;
   }

Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"

Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.

A: 

Try using this XPath expression:

/html[1]/body[1]/div[2]/cs=0[1]/rf=141[1]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]

Using Google Chrome Developer Tools and Firebug in Firefox, it seems like webpage has a 'cs' and 'rf' tags before the first table. Something like:

<cs="0">
  <rf="141">
    <table>
    ...
    </table>
  </rf>
</cs>

There is something that might be useful to know what is happening when you want to parse a known HTML file and you're not getting results as expected. In this case I just did:

string xpath = "";

//In this case I'll get all cells and see what cell has the text "296-12602-1-ND"

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td"))
{
    if (node.InnerText.Trim() == "296-12602-1-ND")
        xpath = node.XPath; //Here it is
}

Or you could just debug your application after document loads, and go through each child node until you find the node you want to get the info from. If you just set a breakpoint when InnerText is found, you can just go through parents and then keep looking for other nodes. I usually do that entering manually commands in a 'watch' window and navigating using the treeview to see properties, attributes and childs.

tsocks
I actually did this yesterday and yes the CS and RF tags show up when stepping through the XPATH tree. But, if you include them in anyway then it complains that it "isn't a node set." If you ignore them, it no longer complains, but I get NULL. I'm baffled. I'm trying python/beautiful soup this morning to see if it's just a bug in HTML Agility or something. Also, digikey has done a damn good job of scrubing out any useful information from the table tags reducing them to just the bare minimum <table>, no ID or anything else I can see that would help identify them apart from a direct path.
Matthias
A: 

Just for an update:

I switched from c# into a bit more friendly Python (my experience with programming is asm, c, and python, the whole OO thing was totally new) and managed to correct my xpath issues. The tag was indeed the problem, but luckily it's unique, so a little regular expression and a removed line and I was in good shape. I'm not sure why a tag like that breaks the XPATH though. If anyone has some insight I'd like to hear it.

Matthias