views:

407

answers:

5

I'm working on some doc file, that when copied and pasted into a text file, gives me the following sample 'output':

ARTA215   ADVANCED LIFE DRAWING (3 Cr) (2:2)  + Studio 1 hr.
This advanced study in drawing with the life ....
Prerequisite: ARTA150
Lab Fee Required

ARTA220   CERAMICS II  (3 Cr) (2:2)  + Studio 1 hr.
This course affords the student the opportunity to ex...
Lab Fee Required

ARTA250   SPECIAL TOPICS IN ART 
  This course focuses on selected topic....

ARTA260   PORTFOLIO DEVELOPMENT   (3 Cr) (3:0)
The purpose of this course is to pre....
BIOS010   INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2) 
This course is a preparatory course designed to familiarize the begi....

BIOS101   GENERAL BIOLOGY (4 Cr) (3:3)
This course introduces the student to the principles of mo...
Lab Fee Required

BIOS102   INTRODUCTION TO HUMAN BIOLOGY  (4 Cr)  (3:3)
This course is an introd....
Lab Fee Required

I want to be able to parse it so that 3 fields are generated and I could output the values into a .csv file.

The line breaks, spacing, etc... is how it could be at any point during this file.

My best guess is for a regex to find 4 capitalized alpha chars followed by 3 num chars, then find out if the next 2 chars are capitalized. (this accounts for the course #, but also excludes the possibility of tripping up during where it might say "prerequisite" as in the first entry). After this, the regex finds the first line break and gets everything after it until it finds the next course #. The 3 fields would be a course number, a course title, and a course description. The course number and title are on the same line always and the description is everything beneath.

Sample end result would contain 3 fields which I'm guessing could be stored into 3 arrays:

"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2)  + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"


Like I said, it's quite a nightmare, but I want to automate this instead of cleaning up after someone each time the file is generated.

+11  A: 

Consider the following example that depends on blocks of course descriptions being completely contained within what Perl considers to be paragraphs:

#! /usr/bin/perl

$/ = "";
my $record_start = qr/
  ^            # starting with a newline
  \s*          # allow optional leading whitespace
  ([A-Z]+\d+)  # capture course tag, e.g., ARTA215
  \s+          # separating whitespace
  (.+?)        # course title on rest of line
  \s*\n        # consume trailing whitespace
/mx;

while (<>) {
  my($course,$title);
  if (s/\A$record_start//) {
    ($course,$title) = ($1,$2);
  }
  elsif (s/(?s:^.+?)(?=$record_start)//) {
    redo;
  }
  else {
    next;
  }

  my $desc;
  die unless s/^(.+?)(?=$record_start|\s*$)//s;
  (my $desc = $1) =~ s/\s*\n\s*/ /g;
  for ($course, $title, $desc) {
    s/^\s+//; s/\s+$//; s/\s+/ /g;
  }
  print join("," => map qq{"$_"} => $course, $title, $desc), "\n";
  redo if $_;
}

When fed your sample input, it outputs

"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
"ARTA220","CERAMICS II (3 Cr) (2:2) + Studio 1 hr.","This course affords the student the opportunity to ex... Lab Fee Required"
"ARTA250","SPECIAL TOPICS IN ART","This course focuses on selected topic...."
"ARTA260","PORTFOLIO DEVELOPMENT (3 Cr) (3:0)","The purpose of this course is to pre...."
"BIOS010","INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2)","This course is a preparatory course designed to familiarize the begi...."
"BIOS101","GENERAL BIOLOGY (4 Cr) (3:3)","This course introduces the student to the principles of mo... Lab Fee Required"
"BIOS102","INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3)","This course is an introd.... Lab Fee Required"
Greg Bacon
Assuming the sample input is literally correct, you can't use paragraph mode...you've missed BIOS010.
ysth
@ysth Nice catch! Revised.
Greg Bacon
yo bacon.... this looks delicious. lets say i have my pl file sitting in a directory with all these courses, titles, and descriptions in a txt file. how can i modify your code to look for "new.txt"? I'm still googl'ing for code on how to feed my txt to your codebeast.....
CheeseConQueso
@CheeseConQueso Assuming you're in a directory that contains my.pl and new.txt, just run perl my.pl new.txt
Greg Bacon
@gbacon - i think i love you man... thanks, this rules
CheeseConQueso
@CheeseConQueso You're welcome!
Greg Bacon
I can't tell: should I feel heartwarmed by this bromantic PDA, or not?
glenn jackman
hahaha... its easy to love strangers when they help you out... and its sexist to think gbacon is a man.. maybe its some hot chick who has a picture of a random dude up there so no one bothers her ass!
CheeseConQueso
A: 

regex may be overkill for this, as the pattern appears to be simply:

[course]
[description]
{Prerequisites}
{Lab Fee Required}

where [course] is composed of

[course#] [course title] {# Cr} [etc/don't care]

and the course# is just the first 7 characters.

so you can scan the file with a simple state-machine, something like:

//NOTE: THIS IS PSEUDOCODE
s = 'parseCourse'
f = openFile(blah)
l = readLine(f)
while (l) {
    if (s=='parseCourse') {
        if (l.StartsWith('Prerequisite:')) {
            extractPrerequisite(l)
        }
        else if (l.StartsWith('Lab Fee Required')) {
            extractLabFeeRequired(l)
        }
        else {
            extractCourseInfo(l)
            s='parseDescription'
        }
    }
    else if (s=='parseDescription') {
        extractDescription(l)
        s='parseCourse'
    }
    l = readLine(f)
}
close(f)
Steven A. Lowe
I think you missed seeing the perl tag on the question :)
ysth
what language is this?
Nathan Fellman
If only it had `$` it could be mistaken for Perl. Except, of course, s/openFile/open/ and s/readLine/readline/ etc ;-)
Sinan Ünür
@[ysth] @[Nathan Fellman] @[Sinan Unur] ya gotta be kidding. PSEUDOCODE! no one is paying me to write Perl, nor is any Perl-specific functionality required for this trivial problem
Steven A. Lowe
+7  A: 

Try:

my $course;
my @courses;
while ( my $line = <$input_handle> ) {
    if ( $line =~ /^([A-Z]{4}\d+)\s+([A-Z]{2}.*)/ ) {
        $course = [ "$1", "$2" ];
        push @courses, $course;
    }
    elsif ($course) {
        $course->[2] .= $line
    }
    else {
        # garbage before first course in file
        next
    }
}

This produces an array of arrays, as I understand you want. It would make more sense to me to have an array of hashes or even a hash of hashes.

ysth
(Before someone comments, the "" aren't useless. Can you figure out why?)
ysth
I'll bite: are the quotation marks necessary because $2 may contain spaces?
Telemachus
@Telemachus: no. The difference is more subtle than that, and would only be visible to the user in exceptional cases, not dependent on what characters are in $1 or $2.
ysth
Of course, we can always "fix" a perfectly understandable piece of code by adding something like this at the end to produce the desired output: print join "\n", map { join ',', map { s/(\r|\n)//gs; qq{"$_"} } @$_ } @courses;
Leonardo Herrera
@Leonardo Herrera: Thanks; I somehow completely missed the .csv part of the question. Consider your fine piece of code to be appended to mine.
ysth
@Ysth: I can't see it. What exceptional cases do the quotations protect you from?
Telemachus
@Telemachus: compare the results of perl -e'/()/; @x=("$1") x 1000000; system "ps v -p $$"' vs. without the quotes around $1.
ysth
+3  A: 

I had roughly the same idea as Gbacon to use paragraph mode since that will neatly chunk the file into records for you. He typed faster, but I wrote one, so here's my crack at it:

#!/usr/bin/env perl
use strict;
use warnings;

local $/ = "";

my @items;
while (<>) {
  my( $course, $description ) = (split /\n/, $_)[0, 1];
  my( $course_id, $name ) = ($course =~ m/^(\w+)\s+(.*)$/);
  push @items, [ $course_id, $name, $description ];
}

for my $record (@items) {
  print "Course id: ", $record->[0], "\n";
  print "Name and credits: ", $record->[1], "\n";
  print "Description: ", $record->[2], "\n";
}

As Ysth points out in a comment on Gbacon's answer, paragraph mode may not work here. If not, never mind.

Telemachus
All answers have permanent links, so you can link to them instead of just referring to them. Also, "below" makes little sense since answers are ordered by usefulness.
bzlm
Yup, I always forget that "above" and "below" are relative terms around here.
Telemachus
"above" and "below" are always relative terms, but they have a meta-relativity here. ;)
Otis
+1 though, man, I can't win around here today.
Telemachus
A: 
#!/usr/bin/perl
$/ = "\n\n";
$FS = "\n";
$, = ',';
while (<>) {
    chomp;
    @F = split($FS, $_);
    print join($,,@F) ."\n";
}
ghostdog74
really? i gotta check this out on monday... im about to leave work... that top part above the while loop looks like the predator's wrist when he initiates the self-destruct sequence
CheeseConQueso