ansaurus

Question

What is the correct syntax to use perl array of hashes with regex? example gets hyperlinks from page

Answer 1

+1 A:

I think you want is:

#!/usr/bin/perl

use strict;
use warnings;

my @array_of_links;

my $field = <<EOS;
<a href="foo.html">foo</a>
<a href="bar.html">bar</a>
<a href="baz.html">baz</a>
EOS

#/ this comment is to unconfuse the SO syntax highlighter. 

while ($field =~ m{<a.*?href="(.*?)".*?>(.*?)</a>}g) {
    push @array_of_links, { url => $1, text => $2 };
}

for my $link (@array_of_links) {
    print qq("$link->{text}" goes to -> "$link->{url}"\n);
}

The /o regex modifier does nothing if no strings are interpolated into it (and it probably shouldn't even be used then because of its surprising behavior). The /m regex modifier does nothing because you don't have the ^ or $ anchors in your regex.

You can't create an array of hashes that way. You may want to reread perldoc perldsc.

C-Style for loops are generally not required in Perl 5. The iterating for loop is much better. If you need to know the index into an array, you should use the range operator:

for my $i (0 .. $#array_of_links) {
    print qq($i. "$array_of_links[$i]{text}" goes to -> "$array_of_links[$i]{url}"\n);
}

Perl 5 allows you to choose your own delimiters for strings and regexes if you use their general forms (e.g. m// for regexes and qq// for double quotes). You can use this to avoid having to use ugly escapes that make your strings and regexes hard to read.

However, it looks like you are trying to use a regex to parse HTML. This is a path that is filled with pain. You should really be looking into how to use a parser instead.

Chas. Owens 2010-09-03 18:00:23

After a little reading I see how this fails when some $@#$! decides to put a comment right in the middle of an anchor tag and that an actual parser is a much more robust manner of getting the data.NOTE:I still enjoy the regex method, because you can see the computation before you. Maybe I should look at some insides of some these parsers. What is the difference between what we are doing here and parsing though? Could I not add in a regex to take out the xml comments before looking for the links, or is a parser the combination of the regexes?

GlassGhost 2010-09-03 19:27:47

Answer 2

+2 A:

I'll begin with the standard disclaimer that parsing HTML with regular expressions is a bad idea.

Evaluate the regular-expression match in scalar context:

In scalar context, each execution of m//g finds the next match, returning true if it matches, and false if there is no further match.

Then for each match, push a new hashref onto @array_of_links:

my @array_of_links;
push @array_of_links => { url => $1, text => $2 }
  while $field =~ /<a.*?href="(.*?)".*?>(.*?)<\/a>/mgo;

Your code for dumping the contents isn't terribly idiomatic. A Perl-style for loop is clearer and less cluttered syntactically:

for (@array_of_links) {
  print qq["$_->{text}" goes to ->"$_->{url}"\n];
}

Greg Bacon 2010-09-03 18:10:54

ansaurus

tags:

views:

answers:

What is the correct syntax to use perl array of hashes with regex? example gets hyperlinks from page

related questions