ansaurus

Question

Trying to remove hex codes from regular expression results

Answer 1

A:

First off, I don't see how the RE you posted would find .TB_Image#TB_window. You could do something like:

/^[#\.]([a-zA-Z0-9_\-]*)\s*{?\s*$/

This would find any occurrences of # or . at the beginning of the line, followed by the tag, optionally followed by a { and then a newline.

Note that this would NOT work for lines like .TB_Image { something: 0; } (all on one line) or div.mydivclass since the . is not at the beginning of the line.

Edit: I don't think nested braces are allowed in CSS, so if you read in all the data and get rid of newlines, you could do something like:

/([a-zA-Z0-9_\-]*([#\.][a-zA-Z0-9_\-]+)+\s*,?\s*)+{.*}/

There's a way to tell a regex to ignore newlines as well, but I never seem to get that right.

Graeme Perrow 2010-02-04 00:08:27

And it doesn't work for `.foo, .bar`.

Felix Kling 2010-02-04 00:12:21

Answer 2

+1 A:

What about this:

([#.]\S+\s*,?)+(?=\{)

Rubens Farias 2010-02-04 00:16:32

Answer 3

A:

It's actually not an easy task to solve with regular expressions since there are a lot of possibilities, consider:

descendant selectors like #someid ul img -- those are all valid tags and are separated by spaces
tags that don't start with . or # (i.e. HTML tag names) -- you have to provide a list of those in order to match them since they have no other difference from attributes
comments
more that I can't think of right now

I think you should instead consider some CSS parsing library suitable for your preferred language.

kemp 2010-02-04 00:28:37

Care to add a reason to downvote?

kemp 2010-02-04 09:51:26

Answer 4

+2 A:

CSS is a very simple, regular language, which means it can be completely parsed by Regex. All there is to it are groups of selectors, each followed by a group of options separated by colons.

Note that all regexes in this post should have the verbose and dotall flags set (/s and /x in some languages, re.DOTALL and re.VERBOSE in Python).

To get pairs of (selectors, rules):

\s*        # Match any initial space
([^{}]+?)  # Ungreedily match a string of characters that are not curly braces.
\s*        # Arbitrary spacing again.
\{         # Opening brace.
  \s*      # Arbitrary spacing again.
  (.*?)    # Ungreedily match anything any number of times.
  \s*      # Arbitrary spacing again.
\}         # Closing brace.

This will not work in the rare case of having a quoted curly bracket in an attribute selector (e.g. img[src~='{abc}']) or in a rule (e.g. background: url('images/ab{c}.jpg')). This can be fixed by complicating the regex some more:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^{}\"\']           # Any character other than braces or quotes.
  |                   # OR
  \"                  # An opening double quote.
    (?:[^\"\\]|\\.)*  # Either a neither-quote-not-backslash, or an escaped character.
  \"                  # And a closing double quote.
  |                   # OR
  \'(?:[^\']|\\.)*\'  # Same as above, but for single quotes.
)+?)       # Ungreedily match all that once or more.
\s*        # Arbitrary spacing again.
\{         # Opening brace.
  \s*      # Arbitrary spacing again.
  ((?:[^{}\"\']|\"(?:[^\"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')*?)
           # The above line is the same as the one in the selector capture group.
  \s*      # Arbitrary spacing again.
\}         # Closing brace.
# This will even correctly identify escaped quotes.

Woah, that's a handful. But if you approach it in a modular fashion, you'll notice it's not as complex as it seems at first glance.

Now, to split selectors and rules, we go have to match strings of characters that are either non-delimiters (where a delimiter is the comma for selectors and a semicolon for rules) or quoted strings with anything inside. We'll use the same pattern we used above.

For selectors:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^,\"\']             # Any character other than commas or quotes.
  |                    # OR
  \"                   # An opening double quote.
    (?:[^\"\\]|\\.)*   # Either a neither-quote-not-backslash, or an escaped character.
  \"                   # And a closing double quote.
  |                    # OR
  \'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?)       # Ungreedily match all that.
\s*        # Arbitrary spacing.
(?:,|$)      # Followed by a comma or the end of a string.

For rules:

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^,\"\']             # Any character other than commas or quotes.
  |                    # OR
  \"                   # An opening double quote.
    (?:[^\"\\]|\\.)*   # Either a neither-quote-not-backslash, or an escaped character.
  \"                   # And a closing double quote.
  |                    # OR
  \'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?)       # Ungreedily match all that.
\s*        # Arbitrary spacing.
(?:;|$)      # Followed by a semicolon or the end of a string.

Finally, for each rule, we can split (once!) on a colon to get a property-value pair.

Putting that all together into a Python program (the regexes are the same as above, but non-verbose to save space):

import re

CSS_FILENAME = 'C:/Users/Max/frame.css'

RE_BLOCK = re.compile(r'\s*((?:[^{}"\'\\]|\"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')+?)\s*\{\s*((?:[^{}"\'\\]|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')*?)\s*\}', re.DOTALL)
RE_SELECTOR = re.compile(r'\s*((?:[^,"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:,|$)', re.DOTALL)
RE_RULE = re.compile(r'\s*((?:[^;"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:;|$)', re.DOTALL)

css = open(CSS_FILENAME).read()

print [(RE_SELECTOR.findall(i),
        [re.split('\s*:\s*', k, 1)
         for k in RE_RULE.findall(j)])
       for i, j in RE_BLOCK.findall(css)]

For this sample CSS:

body, p#abc, #cde, a img .fgh, * {
  font-size: normal; background-color: white !important;

  -webkit-box-shadow: none
}

#test[src~='{a\'bc}'], .tester {
  -webkit-transition: opacity 0.35s linear;
  background: white !important url("abc\"cd'{e}.jpg");
  border-radius: 20px;
  opacity: 0;
  -webkit-box-shadow: rgba(0, 0, 0, 0.6) 0px 0px 18px;
}

span {display: block;} .nothing{}

... we get (spaced for clarity):

[(['body',
   'p#abc',
   '#cde',
   'a img .fgh',
   '*'],
  [['font-size', 'normal'],
   ['background-color', 'white !important'],
   ['-webkit-box-shadow', 'none']]),
 (["#test[src~='{a\\'bc}']",
   '.tester'],
  [['-webkit-transition', 'opacity 0.35s linear'],
   ['background', 'white !important url("abc\\"cd\'{e}.jpg")'],
   ['border-radius', '20px'],
   ['opacity', '0'],
   ['-webkit-box-shadow', 'rgba(0, 0, 0, 0.6) 0px 0px 18px']]),
 (['span'],
  [['display', 'block']]),
 (['.nothing'],
  [])]

Simple exercise for the reader: write a regex to remove CSS comments (/* ... */).

Max Shawabkeh 2010-02-04 09:42:39

ansaurus

tags:

views:

answers:

Trying to remove hex codes from regular expression results

related questions