ansaurus

Question

Using Perl, how can I sort an array using the value of a number inside each array element?

Answer 1

+11 A:

Looks like you need a Schwartzian Transform:

#!/usr/bin/perl

use strict;
use warnings;

my @a = <DATA>;

print 
    map  { $_->[1] }                #get the original value back
    sort { $a->[0] <=> $b->[0] }    #sort arrayrefs numerically on the sort value
    map  { /sj\.(.*?)p/; [$1, $_] } #build arrayref of the sort value and orig
    @a;

__DATA__
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
12 16 sj.875604p875631.06.gz
12 16 sj.875632p875659.07.gz
12 16 sj.875660p875687.08.gz
12 16 sj.875688p875715.09.gz
12 16 sj.875716p875743.10.gz

Chas. Owens 2009-05-01 01:06:57

Your regex is wrong. The number part stops on the "p", not the ., so your regex should be /sj\.(\d+)p/

Matt Kane 2009-05-01 01:09:46

Do not use \d to mean [0-9]. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).

Chas. Owens 2009-05-01 01:14:17

@Chas: I did not know that about \d. (That must be what's causing my bugs -- **all those** MONGOLIAN DIGIT 5s out there... ;))

j_random_hacker 2009-05-01 04:45:30

One thing though: depending on the data, this seems to sort it in descending order or sometimes in ascending order. Any reason why?

Nick 2009-05-01 08:54:25

nevermind, I had data that didn't match the regexp in the second map

Nick 2009-05-01 08:58:27

Answer 2

+1 A:

Yes. The sort function takes an optional comparison function which will be used to compare two elements. It can take the form of either a block of code, or the name of a function to call.

There is an example at the linked document that is similar to what you want to do:

# inefficiently sort by descending numeric compare using
# the first integer after the first = sign, or the
# whole record case-insensitively otherwise

@new = sort {
($b =~ /=(\d+)/)[0] <=> ($a =~ /=(\d+)/)[0]
      ||
            uc($a)  cmp  uc($b)
} @old;

RBerteig 2009-05-01 01:08:18

\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).

Chas. Owens 2009-05-01 01:18:33

+1, but Chas. Owens' solution is likely to be quite a bit faster as regex matching is only performed once.

j_random_hacker 2009-05-01 04:50:22

So that's three (maybe four times) in _one_ thread that we have heard about the dreaded 'MONGOLIAN DIGIT' problem. I'm genuinely curious: did you have a really bad case of Mongolian data flu at some point?

Telemachus 2009-05-01 11:30:19

No, just trying to make sure people get the news to stop using \d (at least in Perl 5.8 and 5.10). And maybe if enough people find out, there will be enough pressure to get it fixed in 5.12. U+1815 is just a handy you-will-never-want-to-match-this character.

Chas. Owens 2009-05-01 12:18:41

@Chas - Do you have a link to a more complete description of the problem? I agree that "\d" is wrong if you want to strictly match only ASCII 0-9 but shouldn't it match things like U+1815 if you're parsing Unicode data? Furthermore, it's perfectly safe to use it to mean "[0-9]" if you know that your data is ASCII (which is likely).

Michael Carman 2009-05-01 13:24:56

@ Michael Carman - The problem is "knowing" your data is ASCII. We are increasingly moving into a world were UTF-8 is the default character encoding. Any code you write today that assumes it is working on ASCII will break tomorrow. As for matching any digit characters, there is always \p{N}, \p{Nd}, \p{Nl}, \p{No}, which are much better since the state explicitly what type of digit you are looking for. Until "\x{1815}" + 1 is 6, \d should mean [0-9] because people use \d to mean "numbers I can do math with".

Chas. Owens 2009-05-01 15:05:09

About \d matching more than one kind of digit. Fine. But I suspect that the (very) bright minds shepherding Perl have given that a lot of thought, and have decided that a digit is a digit. If you can't do math with them, then that is a bug that must be handled at the conversion from digits to numbers. If the problem is that UNICODE knows about more than one kind of digit at all, then it has to be taken up with them. As it is, I think it is and has to be reasonable to accept digits as digits. (ditto for all the new kinds of spaces...)

RBerteig 2009-05-01 17:57:50

@j_random_hacker, I never said it was fast. In fact, the page linked has a faster example immediately following. I was trying to gently nudge the question over to the documentation where it is actually answered already.

RBerteig 2009-05-01 17:59:09

All that said, the Schwartzian Transform shown by Cas. Owens is likely to be the fastest solution because it extracts the field on which to sort and converts it to a suitable type exactly once per original record.

RBerteig 2009-05-01 20:24:18

@RBerteig: Sure, I was just pointing out the fact as an aside. I happen to think your algorithm is much easier to follow (maybe because that's how I usually code it up... ;))

j_random_hacker 2009-05-02 15:38:33

To those who are tired of hearing about MONGOLIAN DIGIT 5: It's weird artefacts like this that will be the source of huge numbers of bugs (and exploits) in the next few years. I'm thinking about the **days** of future debugging time I'm gonna save by hearing about this now.

j_random_hacker 2009-05-02 15:45:19

Answer 3

+2 A:

You can use a regex to pull the number out of every line inside the block you pass to the sort function:

@newArray = sort { my ($anum,$bnum); $a =~ /sj\.([0-9]+)p/; $anum = $1; $b =~ /sj\.(\d+)p/; $bnum = $1; $anum <=> $bnum } @theArr;

However, Chas. Owens's solution is better, since it only does the regex matches once for every element.

Matt Kane 2009-05-01 01:11:47

\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).

Chas. Owens 2009-05-01 01:18:43

Answer 4

+1 A:

Here's an example that sorts them ascending, assuming you don't care too much about efficiency:

use strict;

my @theArr = split(/\n/, <<END_SAMPLE);
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
END_SAMPLE

my @sortedArr = sort compareBySJ @theArr;

print "Before:\n".join("\n", @theArr)."\n";
print "After:\n".join("\n", @sortedArr)."\n";

sub compareBySJ {
    # Capture the values to compare, against the expected format
    # NOTE: This could be inefficient for large, unsorted arrays
    #       since you'll be matching the same strings repeatedly
    my ($aVal) = $a =~ /^\d+\s+\d+\s+sj\.(\d+)p/
        or die "Couldn't match against value $a";
    my ($bVal) = $b =~ /^\d+\s+\d+\s+sj\.(\d+)p/
        or die "Couldn't match against value $a";

    # Return the numerical comparison of the values (ascending order)
    return $aVal <=> $bVal;
}

Outputs:

Before:
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
After:
12 16 sj.875352p875407.01.gz
12 16 sj.875408p875435.02.gz
12 16 sj.875436p875535.03.gz
12 16 sj.875536p875575.04.gz
12 16 sj.875576p875603.05.gz
12 16 sj.1012804p1012831.93.gz
12 16 sj.1012832p1012859.94.gz
12 16 sj.1012860p1012887.95.gz
12 16 sj.1012888p1012915.96.gz
12 16 sj.1012916p1012943.97.gz

Plate 2009-05-01 01:12:43

I think your before print is in the wrong place.

Chas. Owens 2009-05-01 01:17:20

\d does not mean what you think it means. In Perl 5.8 and 5.10 it means any UNICODE character that has the digit property. This means that if you are trying to use \d to mean [0-9] you will also inadvertently match "\x{1815}" (MONGOLIAN DIGIT 5).

Chas. Owens 2009-05-01 01:18:52

Thanks, and point taken on the \d.

Plate 2009-05-01 01:40:17

Answer 5

+1 A:

Haha, Mongolian Digits. \d means exactly what he/sh thought it meant: "digit character".

2009-05-01 07:11:17

ansaurus

tags:

views:

answers:

Using Perl, how can I sort an array using the value of a number inside each array element?

related questions