tags:

views:

551

answers:

4

I'm new to Perl and regular expressions and I am having a hard time extracting a string enclosed by double quotes. Like for example,

"Stackoverflow is

awesome"

Before I extract the strings, I want to check if it is the end of the line of the whole text was in the variable:

if($wholeText =~ /\"$/)   #check the last character if " which is the end of the string
{
   $wholeText =~ s/\"(.*)\"/$1/;   #extract the string, removed the quotes
}

My code didn't work; it is not getting inside of the if condition.

+4  A: 

You need to do:

if($wholeText =~ /"$/)
{
    $wholeText =~ s/"(.*?)"/$1/s;
}

. doesn't match newlines unless you apply the /s modifier.

There's no need to escape the quotes like you're doing.

chaos
+1  A: 

For multi-line strings, you need to include the 'm' modifier with the search pattern.

if ($wholeText =~ m/\"$/m) # First m for match operator; second multi-line modifier
{
     $wholeText =~ s/\"(.*?)\"/$1/s;   #extract the string, removed the quotes
}

You will also need to consider whether you allow double quotes inside the string and if so, which convention to use. The primary ones are backslash and double quote (also backslash backslash), or double quote double quote in the string. These slightly complicate your regex.

The answer by @chaos uses 's' as a multi-line modifier. There's a small difference between the two:

  • m

Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string.

  • s

Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.

Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.

Jonathan Leffler
@Brian: what does the question mark in the second expression do? AFAICS, it means 0 or 1 of the previous match of 0 or more characters...
Jonathan Leffler
+1  A: 

The above poster who recommended using the "m" flag in the regular expression is correct, however the regex provided won't quite work. When you say:

$wholeText =~ s/\"(.*)\"/$1/m;   #extract the string, removed the quotes

...the regular expression is too "greedy", which means the (.*) part will gobble up too much of the text. If you have a sample like this:

"The quick brown fox," he said, "jumped over the lazy dog."

...then the above regex will capture everything from "The" through "dog.", which is probably not what you intend. There are two ways to make the regex less greedy. Which one is better has everything to do with how you choose to handle extra " marks inside your string.

One:

$wholeText =~ s/\"([^"]*)\"/$1/m;

Two:

$wholeText =~ s/\"(.*?)\"/$1/m;

In One, the regex says "start with quote, then find everything that is not a quote and remember it, until you see another quote." In Two, the regex says "Start with quote, then find everything until you find another quote." The extra ? inside the ( ) tells the regex processor to not be greedy. Without considering quote escaping within the string, both regular expressions should behave the same.

By the way, this is a classic problem when parsing a CSV ("Comma Separated Values") file, by the way, so looking up some references on that may help you out.

Aaron Brown
I don't think the /m does what you think it does. If you don't have the anchors ^ or $ in your regex, the /m does nothing.
brian d foy
+2  A: 

If you want to anchor a match to the very end of the string (not line, entire string), use the \z anchor:

 if( $wholeText =~ /"\z/ ) { ... }

You don't need a guard condition for this. Just use the right regex in the substitution. If it doesn't match the regex, nothing happens:

 $wholeText =~ s/"(.*?)"\z/$1/s;

I think you really have a different question though. Why are you trying to anchor it to the end of the string? What problems are you trying to avoid?

brian d foy