views:

89

answers:

4

I'd like to be able to identify patterns of the form

28°44'30"N., 33°12'36"E.

Here's what I have so far:

use utf8;
qr{
    (?:
    \d{1,3} \s*  °   \s*
    \d{1,2} \s*  '   \s*
    \d{1,2} \s*  "   \s*
    [ENSW]  \s* \.?
            \s*  ,?  \s*
    ){2}
}x;

Needless to say, this doesn't match. Does it have anything to do with the extended characters (namely the degree symbol)? Or am I just screwing this up big time?

I'd also appreciate directions to CPAN, if you know of something there that will solve my problem. I've looked at Regex::Common and Geo::Formatter, but none of these do what I want. Any ideas?

Update

It turns out that I needed to take out use utf8 when reading the coordinates from a file. If I manually initialize a variable with a coordinate, it would match fine, but as soon as I read that same line from a file, it wouldn't match. Taking out use utf8 solved that. I guess I don't really understand what utf8 is doing.

+1  A: 

You forgot the x modifier on the qr operator.

daxim
Thanks, unfortunately, that was only a typo here. It still doesn't work :(
Pedro Silva
+1  A: 

Try dropping the use utf8 statement.

The degree symbol corresponds to character value 0xB0 in my current encoding (whatever that is, but it ain't UTF8). 0xB0 is a "continuation byte" in UTF8; it is expected to by the second, third, or fourth character of a sequence that begins with something between 0xC2 and 0xF4. Using that string with utf8 will give you an error.

mobrule
it doesn't, on my machine; in fact, it does not seem to make much of a difference. Would you know how I would go about including the damned `º` in the regex?
Pedro Silva
Turns out you were right.
Pedro Silva
Pedro: And that is why you are supposed to [decode your strings properly](http://p3rl.org/UNI) before you work on them with character oriented operations such as regex. By merely dropping the `utf8` pragma, you have swept the symptoms of the problem under the carpet - but it still exists to unexpectedly bite you in the future. I bet that in your program the test string is not a literal as in Kinopiko's answer and `Devel::Peek` would reveal that the simplified example is not functionally equivalent to your real code from which it is derived - please post a *complete* code example the next time.
daxim
You're right, of course. My test strings were read in from a file.
Pedro Silva
+1  A: 

The ?: at the beginning of the regex makes it non-capturing, which is probably why the matches cannot be extracted or seen. Dropping it from the regex may be the solution.

If all of the coordinates are fixed-format, unpack may be a better way of obtaining the desired values.

my @twoCoordinates = unpack 'A2xA2xA2xAx3A2xA2xA2xA', "28°44'30"N., 33°12'36"E.";

print "@twoCoordinates";  # returns '28 44 30 N 33 12 36 E'

If not, then modify the regex:

my @twoCoordinates = "28°44'30"N., 33°12'36"E." =~ /\w+/g;
Zaid
yeah, but I've been simplifying the regex, including removing the non-capturing parentheses, to no avail.thanks for the unpack idea though, it sounds like it should work, although I'm not sure that I won't see coordinates with 3 digit degrees, 1 digit minutes, etc.
Pedro Silva
The thing is, my priority is to actually identify strings of that nature. `unpack` would sure come in handy if I knew a particular string were a coordinate, but but if I knew that I wouldn't need `unpack` because I'd identified it via a regex. :(
Pedro Silva
+5  A: 

This:

use strict;
use warnings;
use utf8;
my $re = qr{
    (?:
    \d{1,3} \s*  °   \s*
    \d{1,2} \s*  '   \s*
    \d{1,2} \s*  "   \s*
    [ENSW]  \s* \.?
            \s*  ,?  \s*
    ){2}
}x;
if (q{28°44'30"N., 33°12'36"E.} =~ $re) {
    print "match\n";
} else {
    print "no match\n";
}

works:

$ ./coord.pl 
match
Kinopiko
confirming that this works
singingfish
just to confirm that this works.
singingfish
And strangely, it matches even without `use utf8`. Your regex is exactly like mine, no? Or what I missing something? Weird; anyway, thanks!
Pedro Silva
It's not strange. If you don't use UTF-8 you get a bytewise match, but if you do use UTF-8 you get a character match. The problem you have is that you have not ensured about your input from the file.
Kinopiko