Hi everybody.
I have an HTML file (encoded in utf-8). I open it with codecs.open(). The file architecture is:
<html>
// header
<body>
// some text
<table>
// some rows with cells here
// some cells contains tables
</table>
// maybe some text here
<table>
// a form and other stuff
</table>
// probably some more text
</body></html>
I need to retrieve only first table (discard the one with form). Ommit all input before first and after corresponding . Some cells contains also paragrahs, bolds and scripts. There is no more than one nested table per row of main table.
How can I extract it to get a list of rows, where each elements holds plain (unicode string) cell's data and a list of rows for each nested table? There's no more than 1 level of nesting.
I tried HTMLParse, PyParse and re module, but can't get this working. I'm quite new to Python, and btw. that's my first post on StackOverflow.