ansaurus

Question

Parse a CSV file using python (to make a decision tree later)

Answer 1

+1 A:

Take a look at the built-in CSV module. Though you probably can't just use it, you can take a sneak peek at the code...

If that's a no-no, your (pseudo)code looks perfectly fine, though you should make use of the str.split() function and use that, reading the file line-by-line.

kaloyan 2010-04-28 01:00:02

Answer 2

+3 A:

Python has some pretty powerful language constructs builtin. You can read lines from a file like:

with open(name_of_file,"r") as file:
    for line in file:
         # process the line

You can use the string.split function to separate the line along commas, and you can use string.strip to eliminate intervening whitespace. Python has very powerful lists and dictionaries.

To create a list, you simply use empty brackets like [], while to create an empty dictionary you use {}:

mylist = []; # Creates an empty list
mydict = {}; # Creates an empty dictionary

You can insert into the list using the .append() function, while you can use indexing subscripts to insert into the dictionary. For example, you can use mylist.append(5) to add 5 to the list, while you can use mydict[key]=value to associate the key key with the value value. To test whether a key is present in the dictionary, you can use the in keyword. For example:

if key in mydict:
   print "Present"
else:
   print "Absent"

To iterate over the contents of a list or dictionary, you can simply use a for-loop as in:

for val in mylist:
    # do something with val

for key in mydict:
    # do something with key or with mydict[key]

Since, in many cases, it is necessary to have both the value and index when iterating over a list, there is also a builtin function called enumerate that saves you the trouble of counting indices yourself:

for idx, val in enumerate(mylist):
    # do something with val or with idx. Note that val=mylist[idx]

The code above is identical in function to:

idx=0
for val in mylist:
   # process val, idx
   idx += 1

You could also iterate over the indices if you so chose:

for idx in xrange(len(mylist)):
    # Do something with idx and possibly mylist[idx]

Also, you can get the number of elements in a list or the number of keys in a dictionary using len.

It is possible to perform an operation on each element of a dictionary or list via the use of list comprehension; however, I would recommend that you simply use for-loops to accomplish that task. But, as an example:

>>> list1 = range(10)
>>> list1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list2 = [2*x for x in list1]
>>> list2
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

When you have the time, I suggest you read the Python tutorial to get some more in-depth knowledge.

Michael Aaron Safyan 2010-04-28 01:02:05

Answer 3

A:

Robert 2010-04-28 02:22:50

You forgot to mention what happens when there are quotes in the original data, resulting in csv input like `Colt,45,"owned by John ""Quick Draw"" McGraw"` ... that finite state machine gets rather complicated.

John Machin 2010-04-28 02:46:09

Thanks John, I updated the answer.

Robert 2010-04-28 03:12:34

So now your finite state machine appears to need a 1-byte lookahead buffer ("and the next character is a quote") which is not a good look -- the decision making process should requite ONLY a current state and ONE input character. BTW why are you egging the OP on to write a FSM when it appears the purpose of her exercise is scripting a decision tree, not low-level byte-bashing?

John Machin 2010-04-28 03:38:24

The CSV module's method does `parse_add_char(self, c)` in http://svn.python.org/projects/python/trunk/Modules/_csv.c when in QUOTE_IN_QUOTED_FIELD state, (same as my method).Just showing all the problems with implementing a CSV parser. OP says: "At the moment, I'm trying to work out the first part: parsing the CSV"If the OP is allowed to use the CSV module, then great. Although if she uses the CSV module the header's first value won't come out right because it's "commented out" with a #. So that indicates to me, that part of the project is writing a parser.

Robert 2010-04-28 04:33:22

Sorry, actually, their method is look behind. If in quote, and current is quote, etc... Your'e right John. Updated the answer.

Robert 2010-04-28 04:43:10

`123,"",456` is quite valid ... you can get that quite easily when the writer has been told to quote all strings and is fed a zero-length string. In any case the reader MUST read that as an empty field. Second point: how difficult is it to test `if field[0].startswith("#"):`? Not very. DOESN'T indicate requirement to write parser.

John Machin 2010-04-28 05:18:27

I agree John, writing a parser is definitely the hard way. The OP should use the built in CSV module. If she can't because it's a uni project where you need to show knowledge of writing a parser, this answer shows many caveats she'll have to deal with. Thanks for helping flush them out.

Robert 2010-04-28 07:11:31

Answer 4

A:

I don't know too much about the builtin csv module that @Kaloyan Todorov talks about, but, if you're reading comma separated lines, then you can easily do this:

for line in file:
    columns = line.split(',')
    for column in columns:
        print column.strip()

This will print all the entries of each line without the leading a tailing whitespaces.

inspectorG4dget 2010-04-28 02:28:46

"""I don't know too much about the builtin csv module""" ... about time you remedied that deficiency ;-)

John Machin 2010-04-28 02:33:16

Totally agree. Spent a good hour or so last night reading the docs. Time well spent.

inspectorG4dget 2010-04-28 18:47:44

Answer 5

+2 A:

Short answer: don't waste time and mental energy (1) reimplementing the built-in csv module (2) reading the csv module's source (it's written in C) -- just USE it!

John Machin 2010-04-28 02:36:23

Example code would help.

blokeley 2010-04-28 10:35:31

@blokely: reading what the OP wrote would help: """This is going towards a uni assignment, so I don't want to receive code"""

John Machin 2010-04-28 10:40:31

Answer 6

+3 A:

Example using the csv module from docs.python.org:

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    print row

Instead of printing the rows, you could just save each row into a list, and then process it in the ID3 later.

database.append(row)

Robert 2010-04-28 03:24:03

Answer 7

+2 A:

Look at csv.DictReader.

Example:

import csv
reader = csvDictReader(open('my_file.csv','rb') # 'rb' = read binary
for d in reader:
    print d # this will print out a dictionary with keys equal to the first row of the file.

wisty 2010-04-28 10:08:35

two typos in the code sample: missing . in `csvDictReader` and no closing `)`

matt wilkie 2010-09-16 21:51:41

ansaurus

tags:

views:

answers:

Parse a CSV file using python (to make a decision tree later)

related questions