views:

272

answers:

5

I have a file of names and addresses as follows (example line)

OSCAR    ,CANNONS      ,8     ,STIEGLITZ CIRCUIT

And I want to read it into a dictionary of name and value. Here self.field_list is a list of the name, length and start point of the fixed fields in the file. What ways are there to speed up this method? (python 2.6)

def line_to_dictionary(self, file_line,rec_num):
  file_line = file_line.lower()  # Make it all lowercase

  return_rec = {}  # Return record as a dictionary

  for (field_start, field_length, field_name) in self.field_list:

    field_data = file_line[field_start:field_start+field_length]

    if (self.strip_fields == True):  # Strip off white spaces first
      field_data = field_data.strip()

    if (field_data != ''):  # Only add non-empty fields to dictionary
      return_rec[field_name] = field_data

  # Set hidden fields
  #
  return_rec['_rec_num_'] = rec_num
  return_rec['_dataset_name_'] = self.name
  return return_rec      
+2  A: 

struct.unpack() combined with s specifiers with lengths will tear the string apart faster than slicing.

Ignacio Vazquez-Abrams
Tried this out but not sure how to deal with overlapping fields
Martlark
... Overlapping fields? Who came up with that one?
Ignacio Vazquez-Abrams
A: 

If you want to get some speed up, you can also store field_start+field_length directly in self.field_list, instead of storing field_length.

I would say that your method is quite fast, compared to what standard Python can do (i.e., without using non-standard, dedicated modules).

EOL
A: 

If your lines include commas like the example, you can use line.split(',') instead of several slices. This may prove to be faster.

lunixbochs
As long as none of the records ever have a comma...
eswald
A: 

You'll want to use the csv module.

It handle not only csv, but any csv-like format which yours seems to be.

e-satis
Unfortunately "CSV-like" may not be enough. It may be possible for fields to contain embedded commas, at which point both `csv` and `line.split(',')` will fail horribly.
Ignacio Vazquez-Abrams
+1  A: 

Edit: Just saw your remark below about commas. The approach below is fast when it comes to file reading, but it is delimiter-based, and would fail in your case. It's useful in other cases, though.

If you want to read the file really fast, you can use a dedicated module, such as the almost standard Numpy:

data = numpy.loadtxt('file_name.txt', dtype=('S10', 'S8'), delimiter=',')   # dtype must be adapted to your column sizes

loadtxt() also allows you to process fields on the fly (with the converters argument). Numpy also allows you to give names to columns (see the doc), so that you can do:

data['name'][42]  # Name # 42

The structure obtained is like an Excel array; it is quite memory efficient, compared to a dictionary.

If you really need to use a dictionary, you can use a dedicated loop over the data array read quickly by Numpy, in a way similar to what you have done.

EOL