




Say, I have a file that has the following lines with a "TIMESTAMP" "NAME":

10:00:00 Bob
11:00:00 Tom
11:00:20 Fred
11:00:40 George
12:00:00 Bill

I want to read this file, group the names that occur in each hour on a single line, then write the revised lines to a file, for example.

10:00:00 Bob
11:00:00 Tom, Fred, George
12:00:00 Bill

Read the file line by line in a block like this:

while(<>) {
    # ... do something with the line in $_
    # specifically, collect the hour and name
    # ignoring malformed lines
    if (/(\d\d):\d\d:\d\d\s+(\w+)/) {
        my $hour = $1;
        my $name = $2;

and build a hash with the first bit by inserting the following in the inner if block

$people{$hour} = $people{$hour} . ", " . $name 

Finally, outside the loop, print the hash:

while ( my ($time, $names) = each(%people) ) {
    print $time . ":00:00 " . $names ."\n";

(This is untested, but this is the basic approach I would take.)

Andrew Walker
The double quotes around the hash keys are unnecessary.
So they are, thanks. Removed.
Andrew Walker

Here's the full solution how to do it.

my @readings = (
    "10:00:00 Bob",
    "11:00:00 Tom",
    "11:00:20 Fred",
    "11:00:40 George",
    "12:00:00 Bill",

my %hours;

for my $line (@readings) {
    $line =~ /^(\d{2}).*?([a-zA-Z]+)/;
    push(@{$hours{$1}}, $2);

for my $hour (sort keys %hours) {
    print "$hour:00:00 ";
    print join ", ", @{$hours{$hour}};
    print "\n";

This results in:

10:00:00 Bob
11:00:00 Tom, Fred, George
12:00:00 Bill
In grouped_by_hour below, for each line from the filehandle, if it has a timestamp and a name, we push that name onto an array associated with the timestamp's hour, using sprintf to normalize the hour in case one timestamp is 03:04:05 and another is 3:9:18.

sub grouped_by_hour {
  my($fh) = @_;

  local $_;
  my %hour_names;

  while (<$fh>) {
    push @{ $hour_names{sprintf "%02d", $1} } => $2
      if /^(\d+):\d+:\d+\s+(.+?)\s*$/;

  wantarray ? %hour_names : \%hour_names;

The normalized hours also allow us to sort with the default comparison. The code below places the input in the special DATA filehandle by having it after the __DATA__ token, but in real code, you might call grouped_by_hour $fh.

my %hour_names = grouped_by_hour \*DATA;
foreach my $hour (sort keys %hour_names) {
  print "$hour:00:00 ", join(", " => @{ $hour_names{$hour} }), "\n";

10:00:00 Bob
11:00:00 Tom
11:00:20 Fred
11:00:40 George
12:00:00 Bill


10:00:00 Bob
11:00:00 Tom, Fred, George
12:00:00 Bill
Greg Bacon
It's funny how that fat comma makes it look like the array is being pushed into the scalar...
Given that, per comments on the original question, all entries for the same hour are contiguous and the file is too large to fit into memory, I would dispense with the hash entirely - if the raw file is too big to fit in memory, then a hash containing all of its data will likely also be too large. (Yes, it's compressing the data a bit, but the hash itself adds substantial overhead.)

My solution, then:

#!/usr/bin/env perl

use strict;
use warnings;

my $current_hour = -1;
my @names;

while (my $line = <DATA>) {
  my ($hour, $name) = $line =~ /(\d{2}):\d{2}:\d{2} (.*)/;
  next unless $hour;

  if ($hour != $current_hour) {
    print_hour($current_hour, @names);
    @names = ();
    $current_hour = $hour;

  push @names, $name;

print_hour($current_hour, @names);


sub print_hour {
  my ($hour, @names) = @_;
  return unless @names;

  print $hour, ':00:00 ', (join ', ', @names), "\n";

10:00:00 Bob
11:00:00 Tom
11:00:20 Fred
11:00:40 George
12:00:00 Bill
Dave Sherohman