ansaurus

Question

skip over HTML tags in Regular Expression patterns

Answer 1

+5 A:

Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.

Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.

Charles Duffy 2009-04-09 01:33:31

The question isn't really about HTML, it's about whitespace, and it's well within the capabilities of regexes.

Alan Moore 2009-04-09 01:39:54

Alan - it's about doing whitespace removal *in a context-sensitive manner*; handling the general case calls for something with the expressiveness of a recursive descent parser.

Charles Duffy 2009-04-09 01:45:38

Answer 2

A:

Try this:

\r?\n[ \t]*

EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.

Alan Moore 2009-04-09 01:36:59

That works for the example given -- but we haven't been given a formal definition for the template syntax, and so don't know if it works in the general case.

Charles Duffy 2009-04-09 01:47:03

And we probably never will be given one; I've never seen any follow-up from anyone posting as "unknown (whatever)".

Alan Moore 2009-04-09 04:54:33

Answer 3

A:

Alan,

I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.

On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)

Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)

So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.

The other way, is to have an or with something like this (not tested!):

'(<[^>]*>)|([\r\n\f ]+)'

This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

2009-05-31 21:04:54

ansaurus

tags:

views:

answers:

skip over HTML tags in Regular Expression patterns

related questions