tags:

views:

522

answers:

3

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.

Basically the scheme looks like this:

[$$price$$]
{
    <h3 class="price">
    $12.99
    </h3>
}

I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:

[$$price$$]{<h3 class="price">$12.99</h3>}

I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.

+5  A: 

Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.

Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.

Charles Duffy
The question isn't really about HTML, it's about whitespace, and it's well within the capabilities of regexes.
Alan Moore
Alan - it's about doing whitespace removal *in a context-sensitive manner*; handling the general case calls for something with the expressiveness of a recursive descent parser.
Charles Duffy
A: 

Try this:

\r?\n[ \t]*

EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.

Alan Moore
That works for the example given -- but we haven't been given a formal definition for the template syntax, and so don't know if it works in the general case.
Charles Duffy
And we probably never will be given one; I've never seen any follow-up from anyone posting as "unknown (whatever)".
Alan Moore
A: 

Alan,

I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.

On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)

Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)

So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.

The other way, is to have an or with something like this (not tested!):

'(<[^>]*>)|([\r\n\f ]+)'

This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.