parsing html to get data | ansaurus

tags:

html-parsing

views:

26

answers:

1

Q:

parsing html to get data

hi, i am having a problem with parsing html from which i would like to get the data

<td id="Company" style="border-bottom-width: 0px; padding-left: 5px">
<strong>ABC</strong>
</td>

so the data i need is of course "ABC" only, i have tried the following parsing rule but it does not work

/<td id=\"Company\" style=\"border-bottom-width: 0px; padding-left: 5px\">
<strong>(.*)<\/strong>
<\/td>/i

anyone can help and is familiar with this?

+1 A:

You really should not use regular expressions to parse html. It always ends up in an convoluted tangled mess.

Use a library which has the fucntionality of tidy like Beautiful Soup, JTidy, nekohtml,.... and walk the DOM tree (or handles the sax events) to get at the contents of the tags.

Regex-es are then beautiful to get the nuggets from the rocks once the HTML/XML parsing is done however.

Peter Tillemans 2010-09-06 16:33:26

hi, but that's the only way i can do it, the other works, just this one won't show up

webdev28 2010-09-07 04:09:04

Check for difference in whitespace : CR-LF vs just CR, spaces, spaces vs tabs. XML is (mostly) space agnostic, regexes are not. Another point is that many regex implementatations require you to specifically turn on "multiline" matching.

Peter Tillemans 2010-09-07 09:54:37

related questions

Converting web page into UITableView

PHP regular expression to remove tags in HTML document

Regex to Match HTML Style Properties.

.NET Html Parser

Non-destructive parsing and modifying of HTML elements in C++

Script to build HTML page from from extracted DIVs from other HTML pages

lxml retrieving odd items with cssselector

php regex for html

Advantages of XSLT or Linq to XML

java parse html + css and convert the output to different lang

What is the best practice for parsing remote content with jQuery?

What regular expression would match this data?

How to parse html and css to understand the layout of the page (java)

How can I clean HTML tags out of a ColdFusion string?

Html Agility Pack - Parsing <li>

Parsing html data with nutch 1.0 and a custom plugin

Parsing HTML in Python

Library to generate .NET XmlDocument from HTML tag soup

HTML Agility pack - parsing tables

What language/tool should I use for HTML parsing?

Extracting meaning full content from web pages

Library Recommendation: C++ HTML Parser

Convert > to HTML entity equivalent within HTML string

Problem with HTML Parser in IE

Options for HTML scraping?