views:

50

answers:

2

Hi, for the last hour I have been trying to figure this out myself but I am just not having any success and thought maybe you could help.

Basically I am having a html email document that has a lot style attributes for inline styling of elements that look somewhat like

<th rowspan="10" style="font-weight: normal; vertical-align: top; text-align: left;" width="87">

Now what I need to do is remove all thie white space so that it becomes:

<th rowspan="10" style="font-weight:normal;vertical-align:top;text-align:left;" width="87">

Playing around in http://www.gskinner.com/RegExr/ I have found this search expression

/style="([\w ;:\-0-9]+)"/gi

that matches only the style attribute with contents but I can't seem to figure out how to replace the white space only within the $1 capture group.

Ultimately I will run this for a project wide find and replace in TextMate in case that matters.

In case you haven't noticed I am a complete newbie to RegEx so please try to explain your solution so I can learn from them for future reference.

Many thanks for reading,

Jannis

+1  A: 

It's a really tough question. Couldn't find a single regexp which does this but you can use a sequence of regexps to do this:

  1. break the lines so style="blabla" appears in a separate line. (mark the separated lines with special strings so you can rejoin later).
  2. do manipulation on the style="blabla" lines.
  3. rejoin the lines
  4. clean remaining special markers.

    here is how I did it with sed (hopefully the conversion to textmate regexp style is easy):

sed -e 's/\(.*\)\(style="[^"]*"\)\(.*\)/AAA\1\nBBB\2\nCCC\3/g' test.txt | sed '/BBB/s/ //g' | sed -e :a -e '$!N;s/\nBBB//;ta' -e 'P;D' | sed -e :a -e '$!N;s/\nCCC//;ta' -e 'P;D' | sed -e 's/AAA//g'

Explanation:

sed -e 's/\(.*\)\(style="[^"]*"\)\(.*\)/AAA\1\nBBB\2\nCCC\3/g' test.txt

break lines with style="...", into 3 lines, and mark with special strings AAA, BBB and CCC. it will result in the file to be like this:

AAA line before style
BBB line with style=""
CCC line after style

Then we apply the next regexp:

sed '/BBB/s/ //g'

removes spaces in all lines starting with BBB (i.e. style lines)

Then we rejoin:

sed -e :a -e '$!N;s/\nBBB//;ta' -e 'P;D'

appends lines starting with BBB to the previous lines (and removes the string BBB)

And then:

sed -e :a -e '$!N;s/\nCCC//;ta' -e 'P;D'

appends lines starting with CCC to the previous lines.

Lastly:

sed -e 's/AAA//g'

removes special string AAA.

It's surely suboptimal and could be done with much less effort using methods other than regexps. (there are even tools for auto-formatting source code). Anyway, this is all I could do in an hour. I'm sure that there exists a single regexp which does what you want, it's just difficult to find it.

AmirW
Nice! A lot of work for 'just' removing white space but I sure appreciate your time of finding and documenting the solution. I was able to use this method to successfully remove the whitespace as I needed. Thanks.
Jannis
@Jannis: I guess repeatedly replacing `/(style="[^" ]*) /` with `\1` would have worked, too. Much simpler than this approach.
Tomalak
+1  A: 

Watch out for shorthand properties. For example, in

style="background: #fff; border: 1px solid #ccc"

...you can safely remove the first three spaces, but the last two, separating the components of the border: shorthand value, must remain. Just for fun, here's a regex that removes any whitespace that's adjacent to the property names and the : and ; delimiters, but not within property values:

((?:\sstyle="|(?!\A)\G))\s*+([a-z]++(?>-[a-z]+)*+)\s*+:\s*+([^;]+?)\s*+;

Replace with:

$1$2:$3;

Testing it in EditPad Pro, it converts this (353 characters):

<th rowspan="10" style="font-weight: normal; vertical-align: top; text-align: left;" width="87"><input title="Search" value="" size=57 style="background: #fff; border: 1px solid #ccc ; border-bottom-color: #999; border-right-color:#999;color: #000; font: 18px arial,sans-serif bold; height: 25px; margin: 0; padding: 5px 8px 0 6px; vertical-align: top">

...to this (330 characters):

<th rowspan="10" style="font-weight:normal;vertical-align:top;text-align:left;" width="87"><input title="Search" value="" size=57 style="background:#fff;border:1px solid #ccc;border-bottom-color:#999;border-right-color:#999;color:#000;font:18px arial,sans-serif bold;height:25px;margin:0;padding:5px 8px 0 6px;vertical-align:top">

But I'm not recommending that you use this, or any regex solution; I'm just curious as to whether it works in TextMate like it does in EditPad. (TextMate uses the Oniguruma regex engine, which supports all the necessary features, so it should work, but I'm not in a position to test it myself.)

But what you really should use for this job is a dedicated CSS compressor/minimizer/minifier; there are lots of them out there.

Alan Moore