views:

1337

answers:

2

I have html tables in one webpage like

<table border=1>
    <tr><td>sno</td><td>sname</td></tr>
    <tr><td>111</td><td>abcde</td></tr>
    <tr><td>213</td><td>ejkll</td></tr>
</table>

<table border=1>
    <tr><td>adress</td><td>phoneno</td><td>note</td></tr>
    <tr><td>asdlkj</td><td>121510</td><td>none</td></tr>
    <tr><td>asdlkj</td><td>214545</td><td>none</td></tr>
</table>

Now from this webpage using html agility pack I want to extract the data of the column address and phone no only. It means for that I have find first in which table there is column address and phoneno.After finding that table I want to extract the data of that column address and phoneno what should I do ?

I can get the table. But after that what should I do don't understand.

And other thing : is feasible that we can extract data from the table through column name.

+3  A: 

Here are some helper methods to help you parse HTML tables to DataTable instances. You can just iterate through the resulting DataTable array to find the one containing the columns you want. The code is coupled with the format of the tables in the HTML, in this case it obtains column information from the first row (<tr>). Also note that no error checking is performed, so this will break will tables that do not follow the format you specified.

Helper methods:

private static DataTable[] ParseAllTables(HtmlDocument doc)
{
    var result = new List<DataTable>();
    foreach (var table in doc.DocumentNode.Descendants("table"))
    {
        result.Add(ParseTable(table));
    }
    return result.ToArray();
}

private static DataTable ParseTable(HtmlNode table)
{
    var result = new DataTable();

    var rows = table.Descendants("tr");

    var header = rows.Take(1).First();
    foreach (var column in header.Descendants("td"))
    {
        result.Columns.Add(new DataColumn(column.InnerText, typeof(string)));
    }

    foreach (var row in rows.Skip(1))
    {
        var data = new List<string>();
        foreach (var column in row.Descendants("td"))
        {
            data.Add(column.InnerText);
        }
        result.Rows.Add(data.ToArray());
    }
    return result;
}

Usage example:

public static void Main(string[] args)
{
    string html = @"
        <html><head></head>
        <body><div>
            <table border=1>
                <tr><td>sno</td><td>sname</td></tr>
                <tr><td>111</td><td>abcde</td></tr>
                <tr><td>213</td><td>ejkll</td></tr>
            </table>
            <table border=1>
                <tr><td>adress</td><td>phoneno</td><td>note</td></tr>
                <tr><td>asdlkj</td><td>121510</td><td>none</td></tr>
                <tr><td>asdlkj</td><td>214545</td><td>none</td></tr>
            </table>
        </div></body>
        </html>";

    HtmlDocument doc = new HtmlDocument();

    doc.LoadHtml(html);

   DataTable addressAndPhones;
   foreach (var table in ParseAllTables(doc))
   {
       if (table.Columns.Contains("phoneno") && table.Columns.Contains("adress"))
       {
           // You found the address and phone number table
           addressAndPhones = table;
       }
   }
}
João Angelo
@Harikrishna, I update the usage example.
João Angelo
@Harikrishna, `Skip` and `Take` are defined in `System.Linq`. You need to add a using statement for that namespace. LINQ is not available in .NET 2.0.
João Angelo
@Harikrishna, as I said the helper functions are highly coupled to a given HTML format. They work for the following example. If you have different inputs you'll have to adapt them to your needs.
João Angelo
@Harikrishna, refer to point 3.10 of http://www.codeproject.com/KB/grid/practicalguidedatagrids2.aspx
João Angelo
@Harikrishna, yes.
João Angelo
@Joao Angelo,When table has no tr tag like "/tr" then it does not parse that information perfectly so for that what I should do ? Like starting tr tag is there and new row starts with new tr starting tag without writing ending tr tag.Is there any option in html agility pack that can first clean the html page then parse the information.
Harikrishna
@Joao Angelo..Please Refer my this question : http://stackoverflow.com/questions/2490765/which-is-the-best-html-tidy-pack-is-there-any-option-in-html-agility-pack-to-mak
Harikrishna
@Joao Angelo..Is there any option in html agility pack that makes the webpage tidy before extracting information ? Or should I use Html Tidy by Mr.Dave Raggett to make page tidy first?
Harikrishna
@Joao Angelo..Thanks for the help.For the missing closing tag I am using right now html tidy pack because of no option in html agility pack.
Harikrishna
@Joao Angelo,I have one major problem and trying to solve it since many days, it is sometimes html page may be like table does not start with column header what I want like table starts with another information and I want to skeep them but error comes like : **Sum of the columns' FillWeight values cannot exceed 65535.**
Harikrishna
@Joao Angelo,What if the table tag is innermost like `<table><tr><td></td><td><table><tr><td><td><table><tr><td><table>`.Then I want to extract the innermost table
Harikrishna
A: 

Loop through tablerows and get column values by index

int index = 0;
foreach(HtmlNode tablerow in table.SelectNodes("tr"))
{
    // skip the first row...
    if(index > 0)
    {
        // select first td element
        HtmlNode td1 = tablerow.SelectSingleNode("td[1]");
        if(td1 != null)
        {
            string address = td1.InnerText;
        }
    }
    index++;
}

If you can modify the webpage, you could use thead for header texts and tbody for actual values.

<table id="mytable">
    <thead><tr><td>Column1</td><td>Column2</td></tr></thead>
    <tbody>
        <tr><td>Value 1</td><td>Value 2</td></tr>
        <tr><td>Value 1</td><td>Value 2</td></tr>
    </tbody>
</table>

Then you wouldn't have to skip the first row.

foreach(HtmlNode tablerow in table.SelectNodes("/table[@id=\"mytable\"]/tbody/tr"))
{
    // ...
}

Have a look at some xpath tutorial, it's very useful with HtmlAgilityPack.

Mika Kolari
@Mike Kolari.. Thanks For the answer.
Harikrishna