tags:

views:

79

answers:

4

I'm trying to parse a single string and get multiple chunks of data out from the same string with the same regex conditions. I'm parsing a single HTML doc that is static (For an undisclosed reason, I can't use an HTML parser to do the job.) I have an expression that looks like:

$string =~ /\<img\ssrc\="(.*)"/;

and I want to get the value of $1. However, in the one string, there are many img tags like this, so I need something like an array returned (@1?) is this possible?

A: 

Use the /g modifier and list context on the left, as in

@result = $string =~ /\<img\ssrc\="(.*)"/g;
Jim Garrison
But I don't have an array of strings, just one. I'm trying to get individual sources out of the multiple img tags in the single string, returned as an array. I tried this but it didn't return anything.
Sho Minamimoto
Robert's answer gives the correct syntax for this approach
leonbloy
What do you think that binding operator is doing? :)
brian d foy
I omitted part of the answer by accident. It has been corrected.
Jim Garrison
+1  A: 

You just need the global modifier /g at the end of the match. Then loop through until there are no matches remaining

my @matches;
while ($string =~ /\<img\ssrc\="(.*)"/g) {
        push(@matches, $1);
}
dalton
+2  A: 

As Jim's answer, use the /g modifier (in list context or in a loop).

But beware of greediness, you dont want the .* to match more than necessary (and dont escape < = , they are not special).

while($string =~ /<img\s+src="(.*?)"/g ) {
  ...
} 
leonbloy
Awesome, yeah I was having a problem with the greediness, that ? fixed it. Say, would you happen to know the list of characters that need to be escaped in regex? I basically escape almost everything because I don't know better :P
Sho Minamimoto
In general you must escape metacharacters and quantifiers. In Perl you have: `Metacharacters: . $ ^ | () [] \ Quantifiers: * + ? {}`But there are some complications - in particular, inside a character class [] things change.
leonbloy
...but the better way to fix that greediness problem is to use `"([^"]*)"`. In many regex engines, this will be more effcient, but, more importantly, it is a clearer statement of your intent: You want to match " followed by some number of *non-doublequote* characters, followed by another ", not two " characters separated by the shortest possible sequence of *any characters at all*.
Dave Sherohman
@Dave: Yes, those are the two common ways of specifying non-greediness, and it's good to be aware of both and use the more appropiate. But (though I agree that yours is a little more semantically correct), in this particular pattern (which ends at the quote) they are exactly equivalent (functionally, perhaps not speedwise) and mine was a little more clear at the eyes.
leonbloy
+1  A: 
@list = ($string =~ m/\<img\ssrc\="(.*)"/g);

The g modifier matches all occurences in the string. List context returns all of the matches. See the m// operator in perlop.

Robert Wohlfarth