ansaurus

Question

How can I generate a set of ranges from the first letters of a list of words in Perl?

Answer 1

+1 A:

Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.

count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
  count += frequency[letter];
  if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
     if (inrange): print rangeStart+'-'
     print letter+' '
     inrange=false
     count=0
  else:
     if (not inrange) rangeStart=letter
     inrange=true

redtuna 2009-09-02 15:33:01

I had something similar to that, yes. My code really started getting messy though, because one range could consist of just one letter. I will try to use your code as a base and see what I come up with. Thanks!

BrianH 2009-09-02 15:41:25

For my example data, A should be in a range by itself. With your code, there needs to be another variable to know the previous letter. So A (2) is not greater than the threshold. But A (2) + B (2) is, so I would like A to be in its own range, and then move on to the next. I'm back to a big mess of code again...

BrianH 2009-09-02 15:56:22

You're right, this doesn't match the sample output. But why should A be in a range by itself? I thought that each range should contain at least 3 elements? Do you want to be "each range should contain at least 3 elements, except that single-element ranges are allowed if combining them with the next element would go over threshold"?

redtuna 2009-09-02 16:17:04

BrianH 2009-09-02 16:48:26

OK. Adding a max number of entries per range (threshold_high) should allow for nice-looking ranges. The new rule is: group if count<threshold_min, unless doing so puts you above threshold_max.

redtuna 2009-09-02 17:58:49

Answer 2

+1 A:

Here's my suggestion:

# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
    $count{ uc( substr( $_, 0, 1 ) ) }++;
}

# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
    push @{ $freq{ $count } }, $letter;
}

# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
    my @sorted_letters = sort @{ $freq{$_} };
    print "$_: @sorted_letters\n";
}

Update: I think that I misunderstood your requirements. The following code block does something more like what you want.

my %count = ();
while (<FILE>)
{
    $count{ uc( substr( $_, 0, 1 ) ) }++;
}

# get the maximum frequency
my $max_freq = (sort values %count)[-1];

my $curr_set_count = 0;
my @curr_set = ();
foreach ('A' .. 'Z') {
    push @curr_set, $_;
    $curr_set_count += $count{$_};

    if ($curr_set_count >= $max_freq) {

        # print out the range of the current set, then clear the set
        if (@curr_set > 1)
            print "$curr_set[0] - $curr_set[-1]\n";
        else
            print "$_\n";

        @curr_set = ();
        $curr_set_count = 0;
    }
}

# print any trailing letters from the end of the alphabet
if (@curr_set > 1)
    print "$curr_set[0] - $curr_set[-1]\n";
else
    print "$_\n";

JSBangs 2009-09-02 15:53:35

So the output of this is:1: E K M N S2: A B G R3: C4: PI think this is a good start too, but I still need to find a way to combine these (in alphabetical order). Need to think on this one - Thanks!

BrianH 2009-09-02 16:00:17

-1 as it doesn't do the page grouping thing OP asked for.

depesz 2009-09-02 16:00:43

Updated my solution, since I realized that I misunderstood the requirements.

JSBangs 2009-09-02 16:08:57

Answer 3

+5 A:

Basic approach:

#!/usr/bin/perl -w
use strict;
use autodie;

my $PAGE_SIZE = 3;
my %frequencies;

open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
    next unless $l =~ m{\A([a-z])}i;
    $frequencies{ uc $1 }++;
}
close $fh;

my $current_sum = 0;
my @letters     = ();
my @pages       = ();

for my $letter ( "A" .. "Z" ) {
    my $letter_weigth = ( $frequencies{ $letter } || 0 );

    if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
        if ( $current_sum ) {
            my $title = $letters[ 0 ];
            $title .= '-' . $letters[ -1 ] if 1 < scalar @letters;
            push @pages, $title;
        }
        $current_sum = $letter_weigth;
        @letters     = ( $letter );
        next;
    }
    push @letters, $letter;
    $current_sum += $letter_weigth;
}
if ( $current_sum ) {
    my $title = $letters[ 0 ];
    $title .= '-' . $letters[ -1 ] if 1 < scalar @letters;
    push @pages, $title;
}

print "Pages : " . join( " , ", @pages ) . "\n";

Problem with it is that it outputs (from your data):

Pages : A , B , C-D , E-J , K-O , P , Q-Z

But I would argue this is actually good approach :) And you can always change the for loop into:

for my $letter ( sort keys %frequencies ) {

if you need.

depesz 2009-09-02 16:00:01

Nope - I mentioned that either way ( A..Z or keys %frequencies ) would be fine with me...I ran this, and it seems to work just as I need it - the code is pretty clean too. This looks great! Thanks very much!

BrianH 2009-09-02 16:04:10

Yep - this is great, and pretty simplistic too - thank you very much!My next step will be to automatically calculate the $PAGE_SIZE. I thought about taking my total count from the file divided by 26, but that could be very lopsided. I'm also thinking about averaging the frequency values. I'll play with it.But thank you again - this is great!

BrianH 2009-09-02 16:14:27

Answer 4

A:

This is an example of how I would write this program.

#! /opt/perl/bin/perl
use strict;
use warnings;

my %frequency;
{
  use autodie;
  open my $data_file, '<', 'datafile';

  while( my $line = <$data_file> ){
    my $first_letter = uc( substr( $line, 0, 1 ) );
    $frequency{$first_letter} ++
  }
  # $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';

# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;

my @index;
my @group;
for my $letter ( sort keys %frequency ){
  my $frequency = $frequency{$letter};

  if( $frequency >= $threshold ){
    if( @group ){
      if( @group == 1 ){
        push @index, @group;
      }else{
        # push @index, [@group]; # copy @group
        push @index, "$group[0]-$group[-1]";
      }
      @group = ();
    }
    push @index, $letter;
  }elsif( sum( @frequency{@group,$letter} ) >= $threshold ){
    if( @group == 1 ){
      push @index, @group;
    }else{
      #push @index, [@group];
      push @index, "$group[0]-$group[-1]"
    }
    @group = ($letter);
  }else{
    push @group, $letter;
  }
}
#push @index, [@group] if @group;
push @index, "$group[0]-$group[-1]" if @group;

print join( ', ', @index ), "\n";

Brad Gilbert 2009-09-02 16:48:56

Answer 5

+1 A:

use strict;
use warnings;
use List::Util qw(sum);

my @letters = ('A' .. 'Z');
my @raw_data = qw(
    Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
    Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
    Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);

# Store the data by starting letter.
my %data;
push @{$data{ substr $_, 0, 1 }}, $_ for @raw_data;

# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar @$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);

# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my @pages = (['']);
for my $letter (@letters){
    my @d = exists $data{$letter} ? @{$data{$letter}} : ();
    if (@{$pages[-1]} - 1 < $MAX_SIZE or @d == 0){
        push @{$pages[-1]}, @d;
        $pages[-1][0] .= $letter;
    }
    else {
        push @pages, [ $letter, @d ];
    }
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for @pages; # Convert letters to range.

FM 2009-09-02 20:02:52

ansaurus

tags:

views:

answers:

How can I generate a set of ranges from the first letters of a list of words in Perl?

related questions