views:

139

answers:

2

I am parsing the tabular information from the html file with the help of the html agility pack.

Now I can do it and it works.

But when the table what I want to extract is inner most.

Or I don't know at which position it is in nested tables.And there can be any number of nested tables and from that I want to extract the information of the table which has column name name,address.

Ex.

<table>
    <table>
           <tr><td>PHONE NO.</td><td>OTHER INFO.</td></tr>
           <tr><td>
              <table>
                 <tr><td>AMOUNT</td></tr>
                 <tr><td>50000</td></tr>
                 <tr><td>80000</td></tr>
              </table>
           </td></tr>
           <tr><td>
              <table>
                 <tr><td>
                     <table>
                         <tr><td>
                              <table>
                                 <tr><td> NAME </td><td>ADDRESS</td>
                                 <tr><td> ABC  </td><td> kfks   </td>
                                 <tr><td> BCD  </td><td> fdsa   </td>
                              </table>
                         </tr></td>
                     </table>
                 </td></tr>
              </table>
           </td></tr>
        </table>

There are many tables but I want to extract the table which has column name name,address. So what should I do ?

+1  A: 

Load the document as a HtmlDocument. Then use an XPath query to find a table that contains no other tables and which has a td in the first row containing "Name".

The XPath implementation is the standard .NET one from System.Xml.XPath, so any documentation about using XPath with XmlDocument will be applicable.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.html");
HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[not(descendant::table) and tr[1]/td['NAME' = normalize-space()]]");

If the "Name" column was fixed, you could use something like 'Name' = normalize-space(tr[1]/td[2]).

To find a table based on several column names, but not the inner most table condition.

HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[tr[1]/td['NAME' = normalize-space()] and tr[1]/td['ADDRESS' = normalize-space()]]");
Lachlan Roche
@Lachlan Roche,Sir.. What if first td is different and not that column "name" and "name" column is the second column like `<tr><td>Serial No.</td><td> NAME </td><td>ADDRESS</td>`` <tr><td>554545455 </td><td> ABC </td><td> kfks </td>`` <tr><td>456656323 </td><td> BCD </td><td> fdsa </td>`
Harikrishna
@Lachlan Roche,Sir.. It is not fixed that at which position it can be.
Harikrishna
@Harikrishna Answer updated to not assume a fixed column.
Lachlan Roche
@Lachlan Roche,Thank You Very Much Sir..
Harikrishna
@Lachlan Roche,Sir.. Like it will return the table which may be at any position in the nested tables and has the column header name and phone no ?
Harikrishna
@Harikrishna This returns the innermost table due to `not(descendant::table)`
Lachlan Roche
Looks nice but doesn't work if NAME is not the first column
Konstantin Spirin
@Konstantin `tr[1]/td` will match any td in the first tr
Lachlan Roche
I also thought so but I just tried and it breaks if you use ADDRESS column (which is second column).
Konstantin Spirin
Now I like the result :)
Konstantin Spirin
@Lachlan Roche,It works for the above table put in the question, but does not work for the one document which I have and does not return any table and el is null in that case.
Harikrishna
@Harikrishna Probably one of: that table does contain more tables; or the header row contains th instead of td; or there is a tbody or thead. For the last case, change `tr[1]/td` to `descendant::tr[1]/td`. For the other cases, the adjustment to the xpath expression should be clear.
Lachlan Roche
@Lachlan Roche,Table(what I want to extract) may be in the nested tables in the html document or it may be like there is no nested tables and table(what I want to extract) is the first table means no nested tables.
Harikrishna
A: 
var table = doc.DocumentNode.SelectSingleNode("//table [not(descendant::table) and tr[1]/td[normalize-space()='ADDRESS'] ]");
Konstantin Spirin
@Konstantin,Hello..But if the column position is not fixed then what should I do like at which position column name "NAME" is?
Harikrishna
@Konstantin,Like it will return the table which may be at any position in the nested tables from particular table and has the column header name and phone no ?
Harikrishna
@Konstantin this is slightly nicer: `td[normalize-space()='ADDRESS']`
Lachlan Roche
I've removed check by NAME. Works for me now.
Konstantin Spirin
@Konstantin,It works for the above table put in the question, but does not work for the one document which I have and does not return any table and table is null in that case.
Harikrishna
Please post examples that do not work
Konstantin Spirin
@konstantin, ok...
Harikrishna