tags:

views:

191

answers:

5

Hi there,

Longtime user/browswer, first time question-asker.

I'm writing a Perl script that will go through a number of HTML files, search them line-by-line for instances of "color:" or "background-color:" (the CSS tags) and print the entire line when it comes across one of these instances. This is fairly straightforward.

Now I'll admit I'm still a beginning programmer, so this next part may be extremely obvious, but that's why I came here :).

What I want it to do is when it finds an instance of "color:" or "background-color:" I want it to trace back and find the name of the element, and print that as well. For example:

If my document contained the following CSS:

.css_class {
    font-size: 18px;
    font-weight: bold;
    color: #FFEFA1;
        font-family: Arial, Helvetica, sans-serif;
}

I would want the script to output something like:

css_class,#FFEFA1

Ideally it would output this as a text file.

I would greatly appreciate any advice that could be given to me regarding this!

Here is my script in full thus far:

$color = "color:";


open (FILE, "index.html");  
@document = `<FILE>`;  
close (FILE);  

foreach $line (@document){  
    if($line =~ /$color/){  
     print $line;  
    }  
}
A: 

Although I have not tested the code below, but something like this should work:

if ($line =~ m/\.(.*?) \{(.*?)color:(.*?);(.*)/) {
 print "$1,$3\n";
}

You should invest some time learning regular expressions for Perl.

Alec Smart
This is a really bad regex. For one, use \s instead of spaces. You're not using any regex modifiers, like /i and /m which you will most likely need here. Finally, what happens if there's no color property?
Artem Russakovskii
+5  A: 

Since you asked for advice (and this isn't a coding service) I'll offer just that.

Always use strictures and warnings:

use strict;
use warnings;

Always check the return value of open calls:

open(FILE, 'filename') or die "Can't read file 'filename' [$!]\n";

Use the three-arg form of open and lexical filehandles instead of globs:

open(my $fh, '<', 'filename') or die "Can't read file 'filename' [$!]\n";

Don't slurp when line-by-line processing will do:

while (my $line = <$fh>) {
    # do something with $line
}

Use backreferences to retrieve data from regex matches:

if ($line =~ /color *: *(#[0-9a-fA-F]{6})/) {
    # color value is in $1
}

Save the class name in a temporary variable so that you have it when you match a color:

if ($line =~ /^.(\w+) *\{/) {
    $class = $1;
}
Michael Carman
I still think this is not the answer needed. Excellent general advices, thought.
Leonardo Herrera
yes the advice was very helpful. Thanks. I have come a bit further in the past few hours with how I am approaching the solution. I'm not having problems with the regex's but rather with the capture of data.Since CSS elements are typically multi-line, I need to figure out how to create an array between the { and } with each linebreak as a delimiter for list items. The final (revised) I need this data in is as follows (example)body:color:#000000
Ryan Max
Just remember that not all CSS elements are multi-line. Many simple cases declare multiple properties on one line. For example: * { margin: 0; padding: 0; }
Telemachus
+2  A: 

Well, this is not as simple as it seems.

CSS classes can be defined in many ways. For example,

    .classy {
         color: black;
    }

Good luck using a line-by-line approach for parsing that.

Actually, my first approach would be searching CPAN. This looks promising:

CSS - Object oriented access to Cascading Style Sheets (CSS)

Edit:

I installed HTML::TreeBuilder and CSS modules from CPAN and concocted the following aberration:

use strict;
use HTML::TreeBuilder;
use CSS;

foreach my $file_name (@ARGV) {
    my $tree = HTML::TreeBuilder->new; # empty tree
    $tree->parse_file($file_name);

    my $styles = $tree->find('style');

    if ($styles) {
        foreach my $style ($styles) {
            # This is an insane hack, not guarantee
            # to work in the future.
            my $css = CSS->new;
            $css->read_string(join "\n", @{$style->{_content}});

            print $css->output;
        }
    }
    $tree = $tree->delete;
}

This thing only prints all the CSS selectors from list of HTML files, but nicely formatted so you should be able to continue from here.

Leonardo Herrera
There's nothing difficult about parsing that line-by-line. You just need to save a copy of the class name when you find one. And while using CPAN is a good thing this is a good (simple) exercise for a Perl novice to cut his teeth on.
Michael Carman
Michael, if you can't see the difficulties of parsing CSS then I suspect you haven't thought it through. To parse a CSS you have to implement a recursive descent parser.
Leonardo Herrera
I was addressing your example, not the general case.
Michael Carman
AFAICS, there is no need for a full recursive parser. The one thing you have to watch out for is CSS comments, but braces don't nest. Text outside of braces is either 1. a selector expression or 2. a comment. Text inside is either 1.a property expression or 2. a comment. You just have to scan for a couple of indicative expressions and transition states on those.
Axeman
unfortunately I am doing this for work and they are not willing to give me permissions to install any modules.
Ryan Max
Even if you think you can't install modules, you can always look inside them to see what they do. And, since they are just text, with a little work you can copy the modules right into your source. There are all sorts of ways around that.
brian d foy
You can easily install modules alongside your own code, and thus don't need permissions to install them in the usual perl dirs. Lookup local::lib on CPAN.
castaway
+2  A: 

For yet another way to do it, you can ask perl to read from the file in sections other than lines, for example by using the "}" as a record separator.

my $color = "color:";

open (my $fh, '<', "index.html") || die "Cant open file $!";  

{
    local $/ = "}";
    while( my $section = <$fh>) {  
    if($section =~ /$color(.*)/) {
        my ($selector) = $line =~ /(.*){/;
        print "$selector, $section\n";  
    }  
}

Untested! Also, this of course assumes that your CSS neatly ends its sections with a } on a line on it's own.

castaway
slick! Way to think outside the box.
Artem Russakovskii
+1  A: 

I'm not having problems with the regex's but rather with the capture of data. Since CSS elements are typically multi-line, I need to figure out how to create an array between the { and } with each linebreak as a delimiter for list items.

No, you don't.

For the problem as stated, the only lines of interest will be those containing either a class name or a color definition, and possibly also lines containing } to mark the end of a class. All other lines can be ignored, so there's no need to put them into an array.

Since class specifications cannot be nested[1], the last seen set of class names will always be the active set of classes. Therefore, you need only record the last seen set of class names and, when a color specification is encountered, print those class names.

There are still some potential difficulties handling cases in which a specification block is shared by multiple classes (.foo, .bar, .baz { ... }), which may or may not be spread across multiple lines, or if multiple attributes are defined on the same line, but dealing with those should follow fairly easily from what I've already laid out. Depending on your input data, you may also need to include a basic state engine to keep track of whether you're in comments or not.

[1] i.e., Although you can have semantically-nested classes, such as .foo and .foo .bar, they have to be specified in the CSS file as

.foo {
  ...
}
.foo .bar {
  ...
}

and cannot be

.foo {
  ...
  .bar {
    ...
  }
}
Dave Sherohman