views:

190

answers:

5

I'm not sure exactly how to explain this, so I'll just start with an example.

Given the following data:

Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry

I want to generate an index based on the first letter of my data, but I want the letters grouped together.

Here is the frequency of the first letters in the above dataset:

   2 A
   2 B
   3 C
   1 E
   2 G
   1 K
   1 M
   1 N
   4 P
   2 R
   1 S

Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:

A B C D-G H-O P Q-Z

Clicking the "D-G" link would show:

Elderberry
Grapefruit
Grapes

In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:

A B C E-G K-N P R-S

Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.

I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.

Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.

I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.

Here is what I started with:

#!/usr/bin/perl

use strict;
use warnings;

my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};

open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";

while ( my $item = <$DATASET> ) {
    chomp($item);
    my $first_letter = uc( substr( $item, 0, 1 ) );
    $index_frequency->{$first_letter}++;
}

foreach my $letter ( sort keys %{$index_frequency} ) {
    if ( $index_frequency->{$letter} ) {

        # build $ranges here
    }
}

My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.

Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.

Thanks in advance!

+1  A: 

Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.

count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
  count += frequency[letter];
  if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
     if (inrange): print rangeStart+'-'
     print letter+' '
     inrange=false
     count=0
  else:
     if (not inrange) rangeStart=letter
     inrange=true
redtuna
I had something similar to that, yes. My code really started getting messy though, because one range could consist of just one letter. I will try to use your code as a base and see what I come up with. Thanks!
BrianH
For my example data, A should be in a range by itself. With your code, there needs to be another variable to know the previous letter. So A (2) is not greater than the threshold. But A (2) + B (2) is, so I would like A to be in its own range, and then move on to the next. I'm back to a big mess of code again...
BrianH
You're right, this doesn't match the sample output. But why should A be in a range by itself? I thought that each range should contain at least 3 elements? Do you want to be "each range should contain at least 3 elements, except that single-element ranges are allowed if combining them with the next element would go over threshold"?
redtuna
BrianH
OK. Adding a max number of entries per range (threshold_high) should allow for nice-looking ranges. The new rule is: group if count<threshold_min, unless doing so puts you above threshold_max.
redtuna
+1  A: 

Here's my suggestion:

# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
    $count{ uc( substr( $_, 0, 1 ) ) }++;
}

# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
    push @{ $freq{ $count } }, $letter;
}

# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
    my @sorted_letters = sort @{ $freq{$_} };
    print "$_: @sorted_letters\n";
}

Update: I think that I misunderstood your requirements. The following code block does something more like what you want.

my %count = ();
while (<FILE>)
{
    $count{ uc( substr( $_, 0, 1 ) ) }++;
}

# get the maximum frequency
my $max_freq = (sort values %count)[-1];

my $curr_set_count = 0;
my @curr_set = ();
foreach ('A' .. 'Z') {
    push @curr_set, $_;
    $curr_set_count += $count{$_};

    if ($curr_set_count >= $max_freq) {

        # print out the range of the current set, then clear the set
        if (@curr_set > 1)
            print "$curr_set[0] - $curr_set[-1]\n";
        else
            print "$_\n";

        @curr_set = ();
        $curr_set_count = 0;
    }
}

# print any trailing letters from the end of the alphabet
if (@curr_set > 1)
    print "$curr_set[0] - $curr_set[-1]\n";
else
    print "$_\n";
JSBangs
So the output of this is:1: E K M N S2: A B G R3: C4: PI think this is a good start too, but I still need to find a way to combine these (in alphabetical order). Need to think on this one - Thanks!
BrianH
-1 as it doesn't do the page grouping thing OP asked for.
depesz
Updated my solution, since I realized that I misunderstood the requirements.
JSBangs
+5  A: 

Basic approach:

#!/usr/bin/perl -w
use strict;
use autodie;

my $PAGE_SIZE = 3;
my %frequencies;

open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
    next unless $l =~ m{\A([a-z])}i;
    $frequencies{ uc $1 }++;
}
close $fh;

my $current_sum = 0;
my @letters     = ();
my @pages       = ();

for my $letter ( "A" .. "Z" ) {
    my $letter_weigth = ( $frequencies{ $letter } || 0 );

    if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
        if ( $current_sum ) {
            my $title = $letters[ 0 ];
            $title .= '-' . $letters[ -1 ] if 1 < scalar @letters;
            push @pages, $title;
        }
        $current_sum = $letter_weigth;
        @letters     = ( $letter );
        next;
    }
    push @letters, $letter;
    $current_sum += $letter_weigth;
}
if ( $current_sum ) {
    my $title = $letters[ 0 ];
    $title .= '-' . $letters[ -1 ] if 1 < scalar @letters;
    push @pages, $title;
}

print "Pages : " . join( " , ", @pages ) . "\n";

Problem with it is that it outputs (from your data):

Pages : A , B , C-D , E-J , K-O , P , Q-Z

But I would argue this is actually good approach :) And you can always change the for loop into:

for my $letter ( sort keys %frequencies ) {

if you need.

depesz
Nope - I mentioned that either way ( A..Z or keys %frequencies ) would be fine with me...I ran this, and it seems to work just as I need it - the code is pretty clean too. This looks great! Thanks very much!
BrianH
Yep - this is great, and pretty simplistic too - thank you very much!My next step will be to automatically calculate the $PAGE_SIZE. I thought about taking my total count from the file divided by 26, but that could be very lopsided. I'm also thinking about averaging the frequency values. I'll play with it.But thank you again - this is great!
BrianH
A: 

This is an example of how I would write this program.

#! /opt/perl/bin/perl
use strict;
use warnings;

my %frequency;
{
  use autodie;
  open my $data_file, '<', 'datafile';

  while( my $line = <$data_file> ){
    my $first_letter = uc( substr( $line, 0, 1 ) );
    $frequency{$first_letter} ++
  }
  # $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';

# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;

my @index;
my @group;
for my $letter ( sort keys %frequency ){
  my $frequency = $frequency{$letter};

  if( $frequency >= $threshold ){
    if( @group ){
      if( @group == 1 ){
        push @index, @group;
      }else{
        # push @index, [@group]; # copy @group
        push @index, "$group[0]-$group[-1]";
      }
      @group = ();
    }
    push @index, $letter;
  }elsif( sum( @frequency{@group,$letter} ) >= $threshold ){
    if( @group == 1 ){
      push @index, @group;
    }else{
      #push @index, [@group];
      push @index, "$group[0]-$group[-1]"
    }
    @group = ($letter);
  }else{
    push @group, $letter;
  }
}
#push @index, [@group] if @group;
push @index, "$group[0]-$group[-1]" if @group;

print join( ', ', @index ), "\n";
Brad Gilbert
+1  A: 
use strict;
use warnings;
use List::Util qw(sum);

my @letters = ('A' .. 'Z');
my @raw_data = qw(
    Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
    Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
    Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);

# Store the data by starting letter.
my %data;
push @{$data{ substr $_, 0, 1 }}, $_ for @raw_data;

# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar @$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);

# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my @pages = (['']);
for my $letter (@letters){
    my @d = exists $data{$letter} ? @{$data{$letter}} : ();
    if (@{$pages[-1]} - 1 < $MAX_SIZE or @d == 0){
        push @{$pages[-1]}, @d;
        $pages[-1][0] .= $letter;
    }
    else {
        push @pages, [ $letter, @d ];
    }
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for @pages; # Convert letters to range.
FM