views:

754

answers:

1
+3  Q: 

HTML Agility Pack

I want to parse the html table using html agility pack. I want to extract only some predefined column data from the table.

But I am new to parsing and html agility pack and I have tried but I don't know how to use the html agility pack for my need.

If anybody knows then give me example if possible

EDIT :

Is it possible to parse html table like if we want to extract the decided column names' data only ? Like there are 4 columns name,address,phno and I want to extract only name and address data.

+3  A: 

There is an example of that in the discussion forums here. Scroll down a bit to see the table answer. I do wish they would provide better samples that were easier to find.

EDIT: To extract data from specific columns you would have to first find the <th> tags that correspond to the columns you want and remember their indexes. You would then need to find the <td> tags for the same indexes. Assuming you know the indexes of the columns you could do something like this:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://somewhere.com");
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
    HtmlNode addressNode = row.SelectSingleNode("td[2]");
    //do something with address here
    HtmlNode phoneNode = row.SelectSingleNode("td[5]");
    // do something with phone here
}

Edit2: If you don't know the indexes of the columns you could do the whole thing like this. I have not tested this.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://somewhere.com");
var tables = doc.DocumentNode.SelectNodes("//table");

foreach(var table in tables)
{
    int addressIndex = -1;
    int phoneIndex = -1;
    var headers = table.SelectNodes("//th");
    for (int headerIndex = 0; headerIndex < headers.Count(); headerIndex++)
    {
        if (headers[headerIndex].InnerText == "address")
        {
            addressIndex = headerIndex;
        }
        else if (headers[headerIndex].InnerText == "phone")
        {
            phoneIndex = headerIndex;
        }
    }

    if (addressIndex != -1 && phoneIndex != -1)
    {
        foreach (var row in table.SelectNodes("//tr"))
        {
            HtmlNode addressNode = row.SelectSingleNode("td[addressIndex]");
            //do something with address here
            HtmlNode phoneNode = row.SelectSingleNode("td[phoneIndex]");
            // do something with phone here
        }
    }
}
Mike Two
@Harikrishna - Is it the same kind of data in each table? Do you want to extract the same columns from all the tables? Do you only want to find one specific table? Help me out a bit here. I keep trying to answer and then you provide more information. Let's get all the information out there.
Mike Two
@Mike Two Sir..Okay sorry for that...Like in web page there are more than one table tag but I want to extract the data from only one table which have column name as we had defined like address and phone no.Other table tag is for other information and not usefull.
Harikrishna
@Mike Two Sir..There are many web page with more than one table.And from every web page I want to extract the data for only one table which has the column name of phone no and address.
Harikrishna
@Harikrishna - no worries, just trying to make the process more efficient. How do you know which table to look for? Is it only because the table has the columns you want? Do you know the position of the columns (like address is 3rd and phone is 7th) as well as the names? Or do you only know the names?
Mike Two
@Mike Two Sir.. Yes I know and the columns names are predefined for which the data should be extracted from table from every webpage..I know the columns name but not the position of those column name.
Harikrishna