views:

768

answers:

6

Hi everyone!

I have the following line:

"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"

I parse this by using a simple regexp:

if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
    my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}

But the ; at the end messes things up and I don't know why. Shouldn't the greedy operator handle "everything"?

+2  A: 

Try making the first 3 (.*) ungreedy (.*?)

Greg
+6  A: 
(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)

should work better

VonC
I think you have an extra ([^;]*); I think the last part is a comment with a smily "Hello ;)"
Ady
Ady: Right: the last part can be as simple as (.*) to get the rest of the line. Fixed
VonC
+13  A: 

The greedy operator tries to grab as much stuff as it can and still match the string. What's happening is the first one (after "say") grabs "0ed673079715c343281355c2a1fde843;2", the second one takes "laka", the third finds "hello " and the fourth matches the parenthesis.

What you need to do is make all but the last one non-greedy, so they grab as little as possible and still match the string:

(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)
Barry Brown
That's great! Can you quick tell me the difference between .*? og .*Thanks! :)
Lasse A Karlsen
The difference is that .*? stops at the first instance of whatever follows, whereas .* stops at the last instance of whatever follows.
eyelidlessness
Ah, great folks! Appreciate it! :-)
Lasse A Karlsen
The ? modifies the * operator to make it non-greedy. You can also use ? with + to make it non-greedy, as well.
Barry Brown
Very good general-case answer, but, for this specific question, I would favor [^;]* over .*? because the boundary which terminates the match is a single character. There are cases where .*? is what you need, but I find it best to avoid .* entirely whenever possible.
Dave Sherohman
+1  A: 

You could make * non-greedy by appending a question mark:

$line =~ /(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)/

or you can match everything except a semicolon in each part except the last:

$line =~ /(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)/
Robert Gamble
+6  A: 

Although a regex can easily do this, I'm not sure it's the most straight-forward approach. It's probably the shortest, but that doesn't actually make it the most maintainable.

Instead, I'd suggest something like this:

$x="14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)";

if (($ts,$rest) = $x =~ /(\d+:\d+)\s+(.*)/)
{
    my($command,$hash,$pid,$handle,$quote) = split /;/, $rest, 5;
    print join ",", map { "[$_]" } $ts,$command,$hash,$pid,$handle,$quote
}

This results in:

[14:48],[say],[0ed673079715c343281355c2a1fde843],[2],[laka],[hello ;)]

I think this is just a bit more readable. Not only that, I think it's also easier to debug and maintain, because this is closer to how you would do it if a human were to attempt the same thing with pen and paper. Break the string down into chunks that you can then parse easier - have the computer do exactly what you would do. When it comes time to make modifications, I think this one will fare better. YMMV.

Tanktalus
+3  A: 

If the values in your semicolon-delimited list cannot include any semicolons themselves, you'll get the most efficient and straightforward regular expression simply by spelling that out. If certain values can only be, say, a string of hex characters, spell that out. Solutions using a lazy or greedy dot will always lead to a lot of useless backtracking when the regex does not match the subject string.

(\d+:\d+)\ssay;([a-f0-9]+);(\d+);(\w+);([^;\r\n]+)
Jan Goyvaerts
Jan, if you want something to be marked up as source code, each line has to start with four spaces. And welcome to SO.
Alan Moore