views:

27

answers:

1

I use the following block of code to read lines out of a file 'f' into a nested list:

for data in f:
     clean_data = data.rstrip()
     data = clean_data.split('\t') 
     t += [data[0]]
     strmat += [data[1:]]

Sometimes, however, the data is incomplete and a row may look like this:

['955.159', '62.8168', '', '', '', '', '', '', '', '', '', '', '', '', '', '29', '30', '0', '0']

It puts a spanner in the works because I would like Python to implicitly cast my list as floats but the empty fields '' cause it to be cast as an array of strings (dtype: s12).

I could start a second 'if' statement and convert all empty fields into NULL (since 0 is wrong in this instance) but I was unsure whether this was best.

  1. Is this the best strategy of dealing with incomplete data?
  2. Should I edit the stream or do it post-hoc?
A: 

The way how you should deal with incomplete values depends on the context of your application (which you haven't mentioned yet).

For example, you can simply ignore missing values

>>> l = ['955.159', '62.8168', '', '', '', '', '', '', '', '', '', '', '', '', '', '29', '30', '0', '0']
>>> filter(bool, l) # remove empty values
['955.159', '62.8168', '29', '30', '0', '0']
>>> map(float, filter(bool, l)) # remove empty values and convert the rest to floats
[955.15899999999999, 62.816800000000001, 29.0, 30.0, 0.0, 0.0]

Or alternatively, you might want to replace them with NULL as you mentioned:

>>> map(lambda x: x or 'NULL', l)
['955.159', '62.8168', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', '29', '30', '0', '0']

As you can see, there are many different strategies of dealing with incomplete data. Anyway, the example snippets here might help you to choose the right one for your task. And as you can see, I prefer the functional programming like build-ins for doing stuff like this, because it's often the shortest and easiest way to do it (and I don't think there will be any noticeable differences in the execution time).

tux21b
Thank you for the reply tux21b. You've introduced me to two built-ins; filter and map. I always prefer readability so the latter solution you've provided is better. I've tried it, it works.The context: The data represents analyzed ECG (ElectroCardioGram) parameters. The rows contain intervals (in msecs) and voltages (in mV) between two peaks. The data is scientific so I need to convert an empty string to NULL as in 'not recorded' rather than 0 msecs or 0mV. It is part of larger strategy to formalize data I have into a large dataset that I can then do further work to.
EmlynC
You are welcome. The "NULL" value in Python is called `None` btw, so it might make sense to use `None` instead of `"NULL"` in your source.Another useful build-in for functional programming is `reduce()` and the itertools module has a couple of more such functions.
tux21b