tags:

views:

223

answers:

4

I would like to read in a dynamic URL what contains a HTML file, and read it like an XML file, based on nodes (HTML tags). Is this somehow possible?

I mean, there is this HTML code:

            <table class="bidders" cellpadding="0" cellspacing="0"> 

            <tr class="bidRow4"> 
                <td>kucik (automata)</td> 
                <td class="right">9 374 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:52</td> 
            </tr> 

            <tr class="bidRow4"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 373 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:52</td> 
            </tr> 

            <tr class="bidRow2"> 
                <td>kucik (automata)</td> 
                <td class="right">9 372 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:42</td> 
            </tr> 

            <tr class="bidRow2"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 371 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:42</td> 
            </tr> 

            <tr class="bidRow0"> 
                <td>kucik (automata)</td> 
                <td class="right">9 370 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:32</td> 
            </tr> 

            <tr class="bidRow0"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 369 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:32</td> 
            </tr> 

            <tr class="bidRow8"> 
                <td>kucik (automata)</td> 
                <td class="right">9 368 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:22</td> 
            </tr> 

            <tr class="bidRow8"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 367 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:22</td> 
            </tr> 

            <tr class="bidRow6"> 
                <td>kucik (automata)</td> 
                <td class="right">9 366 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:12</td> 
            </tr> 

            <tr class="bidRow6"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 365 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:12</td> 
            </tr> 

        </table> 

I want to parse this into a ListView (or a Grid) to create rows with the data contained. All tr are different row, and all td in a given td is a column in the given row.

And also I want it to be as fast as possible, as it would update itself in 5 seconds.

Is there any library for this?

+5  A: 

I recommend HTML Agility Pack. You'll have to handle the GUI part yourself. It doesn't require valid HTML, but creates a HtmlDocument similar to XmlDocument.

Matthew Flaschen
That library is awesome.
Dan Tao
A: 
Joel Coehoorn
Not quite. The XHTML strict standard defines additional requirements on things like what attributes are available for what tags, what tags can be placed where, etc. Unless the HTML document links to a schema and the XML parser actually uses that schema, the document only needs to be syntactically valid XML.
Sean Edwards
This page's syntax never changes, I want to read it's content. Maybe the best solution would be RegEx?
fonix232
@fonix232 - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Joel Coehoorn
@Joel Coehoorn:As I said, the syntax never changes, just the data. So this can be parsed by RegEx, if I read the file into a string. There are no changes, no additional info, nor anything when it is updated. Only those fields.
fonix232
A: 

Why not just do string replacement to convert the HTML table into XML:

   <table class="bidders" cellpadding="0" cellspacing="0">

becomes:

   <?xml version="1.0" encoding="UTF-8"?>

and

  <tr class="bidRow4">

becomes

  <item>

and

 <td class="right">

becomes

 <field1>

etc

EDIT 1:

I think also that the DataSet Class has a:

.ReadXML

method such that you could then databind to that dataset:

    DataSet ds = new DataSet();
    ds.ReadXml("foo.xml");
    DataGrid.DataSource = ds;
    DataGrid.DataBind();

or something similar

Darknight
I don't want to convert, as even reading a simple XML document with XMLdocument takes very long time.
fonix232
Sounds like your trying to scrape data off a website, there is not ever going to be a fast way of doing it. You need to find another method of getting that data, what other access to you have to this data?
Darknight
Only this HTML page, as it is rendered by an unknown script, from an unknown database, on an unknown way. So no more access, until I can hack my way around this.
fonix232
One problem with the DataSet method - this file has child nodes. So it will cause an exception, and it can't run down sadly.
fonix232
Sorry I'm not sure what you mean, have you actually tried this method?
Darknight
Yes I did try it, DataGridView has no such call as DataBind().
fonix232
What about a DataGrid? I'm 99.99999% sure it has a databind method!
Darknight
Maybe, but DataGrid has to be added manually to the Form (writing the code by my hand) as it does not have an icon in my toolbox (maybe my VS2010 is a bi fruckled up)
fonix232
What?!!! thats no reason not to use a control :) are you mad! :) i kid.anyway you can reset your toolbox here http://stackoverflow.com/questions/1268298/how-to-rebuild-the-visual-studio-toolbox
Darknight
A: 

I normally use Fast XPath Reader in combination with LinqToXML for the job. It is rather old (2007) though.

I wasn't aware of the HTML Agility Pack, so I can't say how it compares (in both performance and ease of use).

Christina Mayers