ansaurus

Question

How do I extract and parse quoted strings in Perl?

Answer 1

+1 A:

#!/usr/bin/perl


$source_file = "tmp.txt";
$dest_xml_file = "my.xml";

#Check existance of root directory
open(SOURCE_FILE, "$source_file") || die "Fail to open file $source_file";
open(DEST_XML_FILE, ">$dest_xml_file") || die "Coult not open output file $dest_xml_file";

$x = 0;

print DEST_CS_FILE  "<!-- from tmp.txt-->\n";
while (<SOURCE_FILE>)   
{
    &ConstantParseAndPrint;

}

sub ConstantParseAndPrint
{
    if ($x == 0)
    {

#        if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9])+>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
        if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)

                {
                    $name1 = $1;
                    $name2 = $2;
                    $name3 = $3;
                    $name4 = $4;
                    $name5 = $5;
                    $name6 = $7;
                    $name7 = $8;
                    printf DEST_XML_FILE "\t\t$name1";
                    printf DEST_XML_FILE "\t\t$name2";
                    printf DEST_XML_FILE "\t\t$name3";
                    printf DEST_XML_FILE "\t\t$name4";
                    printf DEST_XML_FILE "\t\t$name5";
                    printf DEST_XML_FILE "\t\t$name6";
                    printf DEST_XML_FILE "\t\t$name7\n";
#                    $x = 1;
        }
    }
}

Use the following parse code:

if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)

I have added handling of both the single and double quotes. I use back-reference for quotes matching. Also I have updated the character class for path. i.e. it now includes the colon(:), dot(.), and backslash() along with alpha-numeric characters.

RahulJ 2009-12-22 11:11:22

Hello RahulJ, I tested your code. It still can't work. btw, Why do you add a "\6" in you code? Thank you.

Nano HE 2009-12-22 11:24:44

Oh god the leaning toothpicks. Use a different delimiter and the /x flag!

Schwern 2009-12-22 11:47:01

Hi Nano HE, I have added the code of your file. Use the file as it is.The \6 is used as a back-reference, so as to capture the quotes (whether its double or single quote) to be matched at the end of the url. The qoutes around the URL should be same.

RahulJ 2009-12-23 10:37:09

Hi RahuIj. I am appreciated for you for your fully support. btw. i still find a issue in your code above. The code can't get the comment content for the last line only. Such as the code can't parse and get the value of <q>/* U2 Format */</q> for my tmp.txt file. (I updated my original post tmp.txt file from two lines to three lines). Luckly all the other comments could be parsed successfully. THANK YOU A LOT! –

Nano HE 2009-12-23 11:49:28

Hi Nano, I have run my code with your file tmp.txt as input and the output of my.xml file is: constant fixup private AlarmFileName A C:\\TMP\\ALARM.LOG /* A Format */ constant fixup ConfigAlarms U1 0 /* U1 Format */ constant fixup ConfigEvents U2 0 /* U2 Format */It seems ok to me and the last line also contains the comment. May be I am missing something in your problem. Could you please clarify again what's the problem.

RahulJ 2009-12-24 07:10:12

Hi RahulJ, Your solution is fine. I found the difference. If i end my tmp.txt file last line with my flash cursor. the last line comment can't be parsed out successfully. If i press several ENTER key. the flash cursor will go down. Then It's no problem at this condition. THANKS.

Nano HE 2009-12-28 12:24:29

Answer 2

+1 A:

I have intentionally removed match captures (you can add them if you want):

m{^\s*constant\s+fixup\s+(?:private\s+)?\w+\s*=\s*<[^>]+>(?:\s*/\*(?:\s*\w*)+\*/)?$};

Alan Haggai Alavi 2009-12-22 11:48:52

Answer 3

+10 A:

I'm not going to write your regex for you or give you something to cut and paste into your code. At the rate your regex is going its going to break on the next special case anyway. What I will give you is a better approach.

Split each line into right and left hand side of the assignment.

my($lhs, $rhs) = split m{\s* = \s*}x, $line, 2;

Now its much easier to work with them individually. You can extract information from the left hand side by simply splitting it on whitespace to get all the flags (constant, fixup, etc...) and the last word will be the name being assigned to.

my @flags = split /\s+/, $lhs;
my $name  = pop @flags;

Then you can filter your lines by their @flags, if desired.

And the value, which is presumably inside brackets, can be gotten easily. Using a non-greedy regex ensures it correctly handles something like foo = <bar> /* comment <stuff> */.

my($value) = $rhs =~ /<(.*?)>/;

As you can see from this approach, it avoids having to guess what special keywords (constant, fixup, private) might appear in the file.

I have no idea what else might be in this file, you didn't say.

Schwern 2009-12-22 11:55:27

Thank you for you advice. Acturally it is a very big size file. There are some other keywords in the file. I plan to do practice with your method.

Nano HE 2009-12-22 12:22:58

Answer 4

+3 A:

You've got some major design flaws in your code. I haven't addressed your problem, but I have cleaned up your code.

Most importantly, do not use global variables. In a relatively short chunk of code you are using 3 globals. This is BEGGING for mystery bugs that are impossible to track down. This becomes an even bigger issue as your project grows larger over time.

Look into using Perl::Critic. It will help you improve your code.

Here is an annotated, sanitized version of your code:

# Always use strict and warnings.
# It prevents bugs.
use strict;
use warnings;

my $source_file   = "tmp.txt";
my $dest_xml_file = "my.xml";

# You aren't checking the existence of anyting here:
#Check existance of root directory 
# Is this a TODO item?

# Use 3 argument open with a lexical filehandle.
# Adding $! to your error messages makes them more useful.
open my $source_fh, '<', $source_file
    or die "Fail to open file $source_file - $!";

open( my $dest_fh, '>', $dest_xml_file 
    or die "Coult not open output file $dest_xml_file - $!";

my $x = 0;  # What the heck does this do?  Give it a meaningful name or
            # delete it.

print $dest_fh  "<!-- from tmp.txt-->\n";
while (my $line = <$source_fh>)   
{

    # Don't use global variables.
    # Explicitly pass all data your sub needs.
    # Any values that need to be applied to external 
    # data should be applied by the calling function,
    # from data that is returned.

    $x = ConstantParseAndPrint( $line, $x, $dest_fh );

}

sub ConstantParseAndPrint {
    my $line          = shift;
    my $mystery_value = shift;
    my $fh            = shift;

    if($mystery_value == 0) {

        # qr{} is a handy way to build a regex.
        # using {} instead of // to mark the boundaries helps
        # cut down on the escaping required when your pattern
        # contains the '/' character.

        # Use the x regex modifier to allow whitespace and 
        # comments in your regex.
        # This very is important when you can't avoid using a big, complex regex.

        # But really don't do it this way at all.
        # Do what Schwern says.
        my $line_match = qr{
            ^                      \s*  # Skip leading spaces
            (constant)             \s*  # look for the constant keyword
            (fixup|/\*fixup\*/|)   \s*  # look for the fixup keyword
            (private|)             \s*  # look for the prive keyword
            (\w+)                  \s+  # Get parameter name
            =                      \s+  
            <                           # get bracketed values
            ([a-zA-Z0-9]+)         \s+  # First value 
            ([a-zA-Z0-9])+              # Second value
            >                      \s*
            (/\*\s*(.*?)\s*\*/|)        # Find any trailing comment
            (\r|\n|\s)                  # Trailing whitespace
        }x;


        if( $line =~ /$line_match/ ) {

            # Any time you find yourself making variables
            # with names like $foo1, $foo2, etc, use an array.

            my @names = ( $1, $2, $3, $4, $5, $6, $7 );

            # printf is for printing formatted data.  
            # If you aren't using any format codes, use print.

            # Using an array makes it easy to print all the tokens.
            print $fh "\t\t$_" for @names;

            $mystery_value = 1;

        }
    }

    return $mystery_value;
}

As to your parsing question, follow Schwern's advice. Big, complex regexes are a sign that you need to simplify. Break big problems into manageable tasks.

daotoad 2009-12-22 20:44:27

"Break big problems into manageable tasks" - thanks for your guide.

Nano HE 2009-12-23 11:50:18

Answer 5

+1 A:

unify first!

$yourstring =~ s,\\,/,g;  # transform '\' into '/'
$yourstring =~ s,/+,/,g;  # transform multiple '/' into one '/'

2009-12-22 20:54:47

Answer 6

+2 A:

As was mentioned, you need some structure in your regex. In refatoring your code, I made a couple assumptions

You don't want to just print it out in a tabbed delimited format
The only reason for the $x variable is so that you only print one line. (although, a last at the end of the loop would have worked just fine.).

Having assumed these things, I decided that, in addressing your question, I would:

Show you how to make a good modifiable regex.
Code very simple "semantic actions" which store the data and let you use it as you please.

In addition is should be noted that I changed input to a __DATA__ section and output is restricted to STDERR--through the use of Smart::Comment comments, that hep me inspect my structures.

First the code preamble.

use strict;   # always in development!
use warnings; # always in development!
use English qw<$LIST_SEPARATOR>; # It's just helpful.
#use re 'debug';
#use Smart::Comments

Note the commented-out use re.... If you really want to see the way a regular expression gets parsed, it will put out a lot of information that you probably don't want to see (but can make your way through--with a little knowledge about regex parsing, nonetheless.) It's commented out because it is just not newbie friendly, and will monopolize your output. (For more about that see re.)

Also commented out is the use Smart::Comments line. I recommend it, but you can get by using Data::Dumper and print Dumper( \%hash ) lines. (See Smart::Comments.)

Specifying the Expression

But on to the regex. I used an exploded form of regex so that the parts of the whole are explained (see perlre). We want a single alphanumeric character OR a quoted string (with allowed escapes).

We also used a list of modifier names, so that the "language" can progress.

The next regex we make in a "do block" or as I like to call it a "localization block", so that I can localize $LIST_SEPARATOR (aka $") to be the regex alternation character. ('|'). Thus when I include the list to be interpolated, it is interpolated as an alternation.

I'll give you time to look at the second regex before talking about it.

# Modifiable list of modifiers
my @mod_names = qw<constant fixup private>;
# Break out the more complex chunks into separate expressions
my $arg2_regex 
    = qr{ \p{IsAlnum}             # accept a single alphanumeric character
        |                         # OR 
          "                       # Starts with a double quote
          (?>                     # -> We just want to group, not capture
                                  # the '?> controls back tracing
              [^\\"\P{IsPrint}]+  # any print character as long as it is not
                                  # a backslash or a double quote
          |   \\"                 # but we will accept a backslash followed by
                                  # a double quote
          |   (\\\\)+             # OR any amount of doubled backslashes
          )*                      # any number of these
          "
        }msx;

my $line_RE 
    = do { local $LIST_SEPARATOR = '|';
           qr{ \A                # the beginning
               \s*               # however much whitespace you need
               # A sequence of modifier names followed by space
               ((?: (?: @mod_names ) \s+ )*)
               ( \p{IsAlnum}+ )  # at least one alphanumeric character
               \s*               # any amount of whitespace
               =                 # an equals sign
               \s*               # any amount of whitespace
               <                 # open angle bracket
                 (\p{IsAlnum}+)  # Alphanumeric identifier
                 \s+             # required whitespace
                 ( $arg2_regex ) # previously specified arg #2 expression
                 [^>]*?
               >                 # close angle bracket
             }msx
             ;   
          };

The regex just says that we want any number of recognized "modifiers" separated by whitespace followed by an alphanumeric idenfier (I'm not sure why you don't want underscores; I don't include them, regardless.)

That is followed by any amount of whitespace and an equals sign. Since the sets of alphanumeric characters, whitespace, and the equals sign are all disjoint, there is no reason to require whitespace. On the other side of the equals sign, the value is delimited by angle brackets, so I don't see any reason to require whitespace on that side either. Before the equals all you've allowed is alphanumerics and whitespace and on the other side, it all has to be in angle brackets. Required whitespace gives you nothing, while not requiring it is more fault-tolerant. Ignore all that and change the *s to + if you are expecting a machine output.

On the other side of the equals sign, we require an angle bracket pair. The pair consists of an alphanumeric argument, with the second argument being EITHER a single alphanumeric character (based on your spec) OR a string which can contain escaped escapes or quotes and even the end angle bracket--as long as the string doesn't end.

Storing the Data

Once the specification has been made, here's just one of the things you can do with it. Because I don't know what you wanted to do with this besides print it out--which I'm going to assume is not the whole purpose of the script.

### $line_RE
my %fixup_map;
while ( my $line = <DATA> ) { 
    ### $line
    my ( $mod_text, $identifier, $first_arg, $second_arg ) 
        = ( $line =~ /$line_RE/ )
        ;
    die 'Did not parse!' unless $identifier;
    $fixup_map{$identifier}
        = { modifiers_for => { map { $_ => 1 } split /\s+/, $mod_text }
          , first_arg     => $first_arg
          , second_arg    => $second_arg
          };

    ### $fixup_map{$identifier} : $fixup_map{$identifier}
}
__DATA__
constant fixup ConfigAlarms  = <U1 0>
constant fixup ConfigAlarms2 = <U1 2>
constant fixup private AlarmFileName = <A "C:\\TMP\\ALARM.LOG">

At the end you can see the DATA section, when you're at the beginning stage as you seem to be here, it's most convenient to dispense with IO logic and use the builtin handle DATA as I do here.

I collect the modifiers in a hash, so that my semantic actions could be

#...
my $data = $fixup_map{$id};
#...
if ( $data->{modifiers_for}{public} ) {
    #...
}

Soap Box

The main problem however is that you don't seem to have a plan. For the second "argument" in the angle brakets, you have a regex that specifies only a single alphanumeric character, but want to expand it to allow escaped strings. I have to expect that you are implementing a small subset and gradually want expand it do do other things. If you neglect a good design from the beginning, it's only going to become more and more of a headache to implement the full-featured "parser".

You may want to implement multi-line values at some point. If you don't understand how to get from a single alphanumeric to a quote-delimited argument, the line-by-line method and the adjustments to the regex dwarf that complexity gap.

So I advise you to use the code here only as a guideline for expanding complexity. I'm answering a question and indicating a direction, not designing or coding a project, so my regex code isn't as expandable as it probably should be.

If the parsing job was complex enough, I would specify a minimal lookahead grammar for Parse::RecDescent, and stick to coding the semantic actions. That's another recommendation.

Axeman 2009-12-23 00:02:37

I've got a bookmark, and I'll read it when I get a chance. Thanks a lot.

Nano HE 2009-12-23 11:57:26

ansaurus

tags:

views:

answers:

How do I extract and parse quoted strings in Perl?

Specifying the Expression

Storing the Data

Soap Box

related questions