ansaurus

Question

Is it possible to parse this nightmare using Perl?

Answer 1

+11 A:

Consider the following example that depends on blocks of course descriptions being completely contained within what Perl considers to be paragraphs:

#! /usr/bin/perl

$/ = "";
my $record_start = qr/
  ^            # starting with a newline
  \s*          # allow optional leading whitespace
  ([A-Z]+\d+)  # capture course tag, e.g., ARTA215
  \s+          # separating whitespace
  (.+?)        # course title on rest of line
  \s*\n        # consume trailing whitespace
/mx;

while (<>) {
  my($course,$title);
  if (s/\A$record_start//) {
    ($course,$title) = ($1,$2);
  }
  elsif (s/(?s:^.+?)(?=$record_start)//) {
    redo;
  }
  else {
    next;
  }

  my $desc;
  die unless s/^(.+?)(?=$record_start|\s*$)//s;
  (my $desc = $1) =~ s/\s*\n\s*/ /g;
  for ($course, $title, $desc) {
    s/^\s+//; s/\s+$//; s/\s+/ /g;
  }
  print join("," => map qq{"$_"} => $course, $title, $desc), "\n";
  redo if $_;
}

When fed your sample input, it outputs

"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
"ARTA220","CERAMICS II (3 Cr) (2:2) + Studio 1 hr.","This course affords the student the opportunity to ex... Lab Fee Required"
"ARTA250","SPECIAL TOPICS IN ART","This course focuses on selected topic...."
"ARTA260","PORTFOLIO DEVELOPMENT (3 Cr) (3:0)","The purpose of this course is to pre...."
"BIOS010","INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2)","This course is a preparatory course designed to familiarize the begi...."
"BIOS101","GENERAL BIOLOGY (4 Cr) (3:3)","This course introduces the student to the principles of mo... Lab Fee Required"
"BIOS102","INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3)","This course is an introd.... Lab Fee Required"

Greg Bacon 2009-06-25 16:38:41

Assuming the sample input is literally correct, you can't use paragraph mode...you've missed BIOS010.

ysth 2009-06-25 16:48:33

@ysth Nice catch! Revised.

Greg Bacon 2009-06-25 17:05:14

yo bacon.... this looks delicious. lets say i have my pl file sitting in a directory with all these courses, titles, and descriptions in a txt file. how can i modify your code to look for "new.txt"? I'm still googl'ing for code on how to feed my txt to your codebeast.....

CheeseConQueso 2009-06-25 17:49:52

@CheeseConQueso Assuming you're in a directory that contains my.pl and new.txt, just run perl my.pl new.txt

Greg Bacon 2009-06-25 17:52:07

@gbacon - i think i love you man... thanks, this rules

CheeseConQueso 2009-06-25 17:54:50

@CheeseConQueso You're welcome!

Greg Bacon 2009-06-25 17:55:47

I can't tell: should I feel heartwarmed by this bromantic PDA, or not?

glenn jackman 2009-06-25 18:44:43

hahaha... its easy to love strangers when they help you out... and its sexist to think gbacon is a man.. maybe its some hot chick who has a picture of a random dude up there so no one bothers her ass!

CheeseConQueso 2009-06-25 18:50:10

Answer 2

A:

regex may be overkill for this, as the pattern appears to be simply:

[course]
[description]
{Prerequisites}
{Lab Fee Required}

where [course] is composed of

[course#] [course title] {# Cr} [etc/don't care]

and the course# is just the first 7 characters.

so you can scan the file with a simple state-machine, something like:

//NOTE: THIS IS PSEUDOCODE
s = 'parseCourse'
f = openFile(blah)
l = readLine(f)
while (l) {
    if (s=='parseCourse') {
        if (l.StartsWith('Prerequisite:')) {
            extractPrerequisite(l)
        }
        else if (l.StartsWith('Lab Fee Required')) {
            extractLabFeeRequired(l)
        }
        else {
            extractCourseInfo(l)
            s='parseDescription'
        }
    }
    else if (s=='parseDescription') {
        extractDescription(l)
        s='parseCourse'
    }
    l = readLine(f)
}
close(f)

Steven A. Lowe 2009-06-25 16:38:53

I think you missed seeing the perl tag on the question :)

ysth 2009-06-25 16:43:07

what language is this?

Nathan Fellman 2009-06-25 17:55:27

If only it had `$` it could be mistaken for Perl. Except, of course, s/openFile/open/ and s/readLine/readline/ etc ;-)

Sinan Ünür 2009-06-25 17:57:58

@[ysth] @[Nathan Fellman] @[Sinan Unur] ya gotta be kidding. PSEUDOCODE! no one is paying me to write Perl, nor is any Perl-specific functionality required for this trivial problem

Steven A. Lowe 2009-06-26 05:21:46

Answer 3

+7 A:

Try:

my $course;
my @courses;
while ( my $line = <$input_handle> ) {
    if ( $line =~ /^([A-Z]{4}\d+)\s+([A-Z]{2}.*)/ ) {
        $course = [ "$1", "$2" ];
        push @courses, $course;
    }
    elsif ($course) {
        $course->[2] .= $line
    }
    else {
        # garbage before first course in file
        next
    }
}

This produces an array of arrays, as I understand you want. It would make more sense to me to have an array of hashes or even a hash of hashes.

ysth 2009-06-25 16:42:15

(Before someone comments, the "" aren't useless. Can you figure out why?)

ysth 2009-06-25 16:46:13

I'll bite: are the quotation marks necessary because $2 may contain spaces?

Telemachus 2009-06-25 17:53:28

@Telemachus: no. The difference is more subtle than that, and would only be visible to the user in exceptional cases, not dependent on what characters are in $1 or $2.

ysth 2009-06-25 18:05:30

Of course, we can always "fix" a perfectly understandable piece of code by adding something like this at the end to produce the desired output: print join "\n", map { join ',', map { s/(\r|\n)//gs; qq{"$_"} } @$_ } @courses;

Leonardo Herrera 2009-06-25 19:07:04

@Leonardo Herrera: Thanks; I somehow completely missed the .csv part of the question. Consider your fine piece of code to be appended to mine.

ysth 2009-06-25 19:21:45

@Ysth: I can't see it. What exceptional cases do the quotations protect you from?

Telemachus 2009-06-25 23:43:31

@Telemachus: compare the results of perl -e'/()/; @x=("$1") x 1000000; system "ps v -p $$"' vs. without the quotes around $1.

ysth 2009-07-04 03:21:43

Answer 4

+3 A:

I had roughly the same idea as Gbacon to use paragraph mode since that will neatly chunk the file into records for you. He typed faster, but I wrote one, so here's my crack at it:

#!/usr/bin/env perl
use strict;
use warnings;

local $/ = "";

my @items;
while (<>) {
  my( $course, $description ) = (split /\n/, $_)[0, 1];
  my( $course_id, $name ) = ($course =~ m/^(\w+)\s+(.*)$/);
  push @items, [ $course_id, $name, $description ];
}

for my $record (@items) {
  print "Course id: ", $record->[0], "\n";
  print "Name and credits: ", $record->[1], "\n";
  print "Description: ", $record->[2], "\n";
}

As Ysth points out in a comment on Gbacon's answer, paragraph mode may not work here. If not, never mind.

Telemachus 2009-06-25 16:57:50

All answers have permanent links, so you can link to them instead of just referring to them. Also, "below" makes little sense since answers are ordered by usefulness.

bzlm 2009-06-25 17:03:02

Yup, I always forget that "above" and "below" are relative terms around here.

Telemachus 2009-06-25 17:04:47

"above" and "below" are always relative terms, but they have a meta-relativity here. ;)

Otis 2009-06-25 17:32:08

+1 though, man, I can't win around here today.

Telemachus 2009-06-25 17:33:52

Answer 5

A:

#!/usr/bin/perl
$/ = "\n\n";
$FS = "\n";
$, = ',';
while (<>) {
    chomp;
    @F = split($FS, $_);
    print join($,,@F) ."\n";
}

ghostdog74 2010-01-29 01:26:26

really? i gotta check this out on monday... im about to leave work... that top part above the while loop looks like the predator's wrist when he initiates the self-destruct sequence

CheeseConQueso 2010-02-05 21:43:25

ansaurus

tags:

views:

answers:

Is it possible to parse this nightmare using Perl?

related questions