tags:

views:

26

answers:

1

hi, i am having a problem with parsing html from which i would like to get the data

<td id="Company" style="border-bottom-width: 0px; padding-left: 5px">
<strong>ABC</strong>
</td>

so the data i need is of course "ABC" only, i have tried the following parsing rule but it does not work

/<td id=\"Company\" style=\"border-bottom-width: 0px; padding-left: 5px\">
<strong>(.*)<\/strong>
<\/td>/i

anyone can help and is familiar with this?

+1  A: 

You really should not use regular expressions to parse html. It always ends up in an convoluted tangled mess.

Use a library which has the fucntionality of tidy like Beautiful Soup, JTidy, nekohtml,.... and walk the DOM tree (or handles the sax events) to get at the contents of the tags.

Regex-es are then beautiful to get the nuggets from the rocks once the HTML/XML parsing is done however.

Peter Tillemans
hi, but that's the only way i can do it, the other works, just this one won't show up
webdev28
Check for difference in whitespace : CR-LF vs just CR, spaces, spaces vs tabs. XML is (mostly) space agnostic, regexes are not. Another point is that many regex implementatations require you to specifically turn on "multiline" matching.
Peter Tillemans