tags:

views:

178

answers:

4

I'm working on a tool that parses files for CSS style declarations. It uses a very complicated regular expression that, besides the expected performance issues and a few minor bugs that aren't affecting me for now, is doing everything I'd like it to do except for one thing.

I have it matching all combinations of element names, classes, sub-classes, pseudo-classes, etc. However, when a line contains more than one declaration, I can only get it to match once. As an example, here is the kind of thing that is tripping me up at the moment:

td.class1, td.class2, td.class3
{
    background-color: #FAFAFA;
    height: 10px;
}

I can write an expression that will satisfy this for all of the three declarations, but since I am also capturing information after it (the actual style info within the brackets) I feel like this entire block of text is considered to be accounted for so the engine moves on to the next character following the whole block that just got processed.

Is there a way to accomplish this where each class will be a separate match and all will include the style info that follows as well? I know that I can modify my regex to match the whole line and then parse it for commas after I get my match, but I'd like to keep all my logic inside the expression itself if possible.

I can post the expression and/or the commented code I use to generate it if it's absolutely relevant to the answer, but the expression is huge/ugly (as all non-trivial regexes are) and the code is a bit lengthy.

+1  A: 

Depending on deep nuances of your regex engine, you may be able to do this by embedding capturing parens in lookaheads, i.e. something like:

\.(\w+)(?=.*?{([^}]*)})

I'd expect figuring out the meaning of the match groups to be quite an exercise.

chaos
match groups can be named, and that's exactly what I'm already doing. I can match and extract the class name and the class body...that's not an issue. My issue is that I was looking for a way to match multiple class names that share a common body. I probably won't be able to, so I'll just match the whole line and split at the commas.
Rich
Right, what I'm saying is that using a pattern like the one above should do that for you, *if* the capture inside a lookahead works. What the lookahead is doing is allowing you to scan forward to the class body and (if the capture works) extract it, *without* moving the actual current position of the regex forward, so it can go on continuing to match class names.
chaos
Your pattern didn't match exactly, but the idea of capturing inside the look ahead was what eventually worked. I didn't get it at first. +1
Rich
+2  A: 

You need a CSS parser, not a regex. You should probably read Is there a CSS Parser for C#.

Chas. Owens
A: 

This is not a good problem for regexes.

On the other hand, you only need a couple of passes to write a basic CSS parser, surely.

CSS syntax is just [some stuff], [open curly bracket], [some other stuff], [close curly bracket] after all.

You find those two chunks of stuff, you split the first one on commas and the second one on semicolons and you're pretty much done.

AmbroseChapel
+2  A: 

Here's a regex that works with your sample data:

@"([^,{}\s]+(?:\s+[^,{}\s]+)*)(?=[^{}]*(\{[^{}]+\}))"

The first part matches and captures a selector (td.class1) in group #1, then the lookahead skips over any remaining selectors and captures the associated style rules in group #2. The next match attempt starts where the lookahead started the previous time, so it matches the next selector (td.class2) and the lookahead captures the same block of rules again.

This won't handle @-rules or comments, but it works fine on the sample data you provided. I even checked it out on some real-world stylesheets and it did remarkably well.

Alan Moore
Thanks. Similar to chaos' answer, capturing inside of the look ahead was the solution. I gave you the accepted answer because your regex actually works on all sorts of sample data that I threw at it (and GREATLY simplified the way I was doing it). I'm stripping comments out before processing anyway, so it seems to be all good now.
Rich