tags:

views:

59

answers:

2

Hello,

I have text which shows course numbers, names, grade and other information for courses taken by students. Specifically, the lines look like these:

0301 453  20071 LINEAR SYSTEMS I                    A    4   4    16.0

0301 481  20071 ELECTRONICS I WITH LAB              A    4   4    16.0

0301 481  20084 ELECTRONICS II WITH LAB      RE     B    4   4    12.0

0301 713  20091 SOLID STATE PHYSICS          NG          0   0     0.0

0511 454  20074 INT'L TRADE & FINANCE               B    4   4    12.0

I want to write a regular expression that extracts:

LINEAR SYSTEMS I
ELECTRONICS I WITH LAB
ELECTRONICS II WITH LAB
SOLID STATE PHYSICS
INT'L TRADE & FINANCE

I wrote the following

pattCourseName = re.compile(r'([-/&A-Z\':\s]{2,})(\s+[A-Z])')

However, this gives me

LINEAR SYSTEMS I
ELECTRONICS I WITH LAB
ELECTRONICS II WITH LAB      RE
SOLID STATE PHYSICS
INT'L TRADE & FINANCE

That is, I cannot get rid of the RE part.

Can someone please help with this? Thanks!

+5  A: 

If the layout is fixed as you show, then forget the regular expression, and just grab the columns you want:

course_name = line[16:45].strip()
Ned Batchelder
Beautiful solution! Thanks!
Curious2learn
+2  A: 
for line in open("file"):
    s=filter(None,line.split(" ",4))
    print s[3].replace("  ","|").split("|",1)[0]

output

$ python myscript.py
LINEAR SYSTEMS I
ELECTRONICS I WITH LAB
ELECTRONICS II WITH LAB
SOLID STATE PHYSICS
INT'L TRADE & FINANCE
ghostdog74
Beautiful! This will be great when the columns do not align and I learnt new commands from your solution. Thanks!
Curious2learn