views:

804

answers:

5

Let's say I have an array, @theArr, which holds 1,000 or so elements such as the following:

01  '12 16 sj.1012804p1012831.93.gz'
02  '12 16 sj.1012832p1012859.94.gz'
03  '12 16 sj.1012860p1012887.95.gz'
04  '12 16 sj.1012888p1012915.96.gz'
05  '12 16 sj.1012916p1012943.97.gz'
06  '12 16 sj.875352p875407.01.gz'
07  '12 16 sj.875408p875435.02.gz'
08  '12 16 sj.875436p875535.03.gz'
09  '12 16 sj.875536p875575.04.gz'
10  '12 16 sj.875576p875603.05.gz'
11  '12 16 sj.875604p875631.06.gz'
12  '12 16 sj.875632p875659.07.gz'
13  '12 16 sj.875660p875687.08.gz'
14  '12 16 sj.875688p875715.09.gz'
15  '12 16 sj.875716p875743.10.gz'
...

If my first set of numbers (between the 'sj.' and the 'p') was always 6 digits, I wouldn't have a problem. But, when the numbers roll over into 7 digits the default sort stops working as the larger 7 digit numbers comes before the smaller 6 digit number.

Is there a way to tell Perl to sort by that number inside the string in each array element?

+11  A: 

Looks like you need a Schwartzian Transform:

#!/usr/bin/perl

use strict;
use warnings;

my @a = <DATA>;

print 
    map  { $_->[1] }                #get the original value back
    sort { $a->[0] <=> $b->[0] }    #sort arrayrefs numerically on the sort value
    map  { /sj\.(.*?)p/; [$1, $_] } #build arrayref of the sort value and orig
    @a;

__DATA__
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
12 16 sj.875604p875631.06.gz
12 16 sj.875632p875659.07.gz
12 16 sj.875660p875687.08.gz
12 16 sj.875688p875715.09.gz
12 16 sj.875716p875743.10.gz
Chas. Owens
Your regex is wrong. The number part stops on the "p", not the ., so your regex should be /sj\.(\d+)p/
Matt Kane
Do not use \d to mean [0-9]. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).
Chas. Owens
@Chas: I did not know that about \d. (That must be what's causing my bugs -- **all those** MONGOLIAN DIGIT 5s out there... ;))
j_random_hacker
One thing though: depending on the data, this seems to sort it in descending order or sometimes in ascending order. Any reason why?
Nick
nevermind, I had data that didn't match the regexp in the second map
Nick
+1  A: 

Yes. The sort function takes an optional comparison function which will be used to compare two elements. It can take the form of either a block of code, or the name of a function to call.

There is an example at the linked document that is similar to what you want to do:

# inefficiently sort by descending numeric compare using
# the first integer after the first = sign, or the
# whole record case-insensitively otherwise

@new = sort {
($b =~ /=(\d+)/)[0] <=> ($a =~ /=(\d+)/)[0]
      ||
            uc($a)  cmp  uc($b)
} @old;
RBerteig
\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).
Chas. Owens
+1, but Chas. Owens' solution is likely to be quite a bit faster as regex matching is only performed once.
j_random_hacker
So that's three (maybe four times) in _one_ thread that we have heard about the dreaded 'MONGOLIAN DIGIT' problem. I'm genuinely curious: did you have a really bad case of Mongolian data flu at some point?
Telemachus
No, just trying to make sure people get the news to stop using \d (at least in Perl 5.8 and 5.10). And maybe if enough people find out, there will be enough pressure to get it fixed in 5.12. U+1815 is just a handy you-will-never-want-to-match-this character.
Chas. Owens
@Chas - Do you have a link to a more complete description of the problem? I agree that "\d" is wrong if you want to strictly match only ASCII 0-9 but shouldn't it match things like U+1815 if you're parsing Unicode data? Furthermore, it's perfectly safe to use it to mean "[0-9]" if you know that your data is ASCII (which is likely).
Michael Carman
@ Michael Carman - The problem is "knowing" your data is ASCII. We are increasingly moving into a world were UTF-8 is the default character encoding. Any code you write today that assumes it is working on ASCII will break tomorrow. As for matching any digit characters, there is always \p{N}, \p{Nd}, \p{Nl}, \p{No}, which are much better since the state explicitly what type of digit you are looking for. Until "\x{1815}" + 1 is 6, \d should mean [0-9] because people use \d to mean "numbers I can do math with".
Chas. Owens
About \d matching more than one kind of digit. Fine. But I suspect that the (very) bright minds shepherding Perl have given that a lot of thought, and have decided that a digit is a digit. If you can't do math with them, then that is a bug that must be handled at the conversion from digits to numbers. If the problem is that UNICODE knows about more than one kind of digit at all, then it has to be taken up with them. As it is, I think it is and has to be reasonable to accept digits as digits. (ditto for all the new kinds of spaces...)
RBerteig
@j_random_hacker, I never said it was fast. In fact, the page linked has a faster example immediately following. I was trying to gently nudge the question over to the documentation where it is actually answered already.
RBerteig
All that said, the Schwartzian Transform shown by Cas. Owens is likely to be the fastest solution because it extracts the field on which to sort and converts it to a suitable type exactly once per original record.
RBerteig
@RBerteig: Sure, I was just pointing out the fact as an aside. I happen to think your algorithm is much easier to follow (maybe because that's how I usually code it up... ;))
j_random_hacker
To those who are tired of hearing about MONGOLIAN DIGIT 5: It's weird artefacts like this that will be the source of huge numbers of bugs (and exploits) in the next few years. I'm thinking about the **days** of future debugging time I'm gonna save by hearing about this now.
j_random_hacker
+2  A: 

You can use a regex to pull the number out of every line inside the block you pass to the sort function:

@newArray = sort { my ($anum,$bnum); $a =~ /sj\.([0-9]+)p/; $anum = $1; $b =~ /sj\.(\d+)p/; $bnum = $1; $anum <=> $bnum } @theArr;

However, Chas. Owens's solution is better, since it only does the regex matches once for every element.

Matt Kane
\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).
Chas. Owens
+1  A: 

Here's an example that sorts them ascending, assuming you don't care too much about efficiency:

use strict;

my @theArr = split(/\n/, <<END_SAMPLE);
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
END_SAMPLE

my @sortedArr = sort compareBySJ @theArr;

print "Before:\n".join("\n", @theArr)."\n";
print "After:\n".join("\n", @sortedArr)."\n";

sub compareBySJ {
    # Capture the values to compare, against the expected format
    # NOTE: This could be inefficient for large, unsorted arrays
    #       since you'll be matching the same strings repeatedly
    my ($aVal) = $a =~ /^\d+\s+\d+\s+sj\.(\d+)p/
        or die "Couldn't match against value $a";
    my ($bVal) = $b =~ /^\d+\s+\d+\s+sj\.(\d+)p/
        or die "Couldn't match against value $a";

    # Return the numerical comparison of the values (ascending order)
    return $aVal <=> $bVal;
}

Outputs:

Before:
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
After:
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
Plate
I think your before print is in the wrong place.
Chas. Owens
\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).
Chas. Owens
Thanks, and point taken on the \d.
Plate
+1  A: 

Haha, Mongolian Digits. \d means exactly what he/sh thought it meant: "digit character".