views:

564

answers:

5

I'm working with an output list that contains the following information:

[start position, stop position, chromosome, 
    [('sample name', 'sample value'), 
     ('sample name','sample value')...]]

[[59000, 59500, chr1, 
    [('cn_04', '1.362352462'), ('cn_01', '1.802001235')]], 
    [100000, 110000, chr1, 
        [('cn_03', '1.887268908'), ('cn_02', '1.990457407'), ('cn_01', '4.302275763')]],
    [63500, 64000, chr1, 
        [('cn_03', '1.887268908'), ('cn_02', '1.990457407'), ('cn_01', '4.302275763')]]
    ...]

I want to write it to an excel file that will format it with the sample names as the titles of columns and then the values for the samples in columns. Some samples don't have values so these spaces would be blank or have no data notation. Something that looks Like this (sorry had to use >> to denote column separations):

cn_01     cn_02     cn_03     cn_04     cn_05     cn_06    start    stop    chromosome  

1.802     ""        ""        1.362     ""        ""       59000    59500   chr1  
4.302     1.990     1.887     ""        ""        ""       100000   110000  chr1

Any help would be great. Thanks!

A: 

You can create a simple text file with "*.csv" extension. Separate each field (column) by a comma. Optionally, use quotation marks for text fields, especially if a field is expected to contain your delimiter (comma). You can even put excel formulas (preceded by '=') and excel will parse them correctly.

Double click on any csv file will open it in excel (unless your computer has other settings).

You can also use the csv module

The Learning Python book contains examples with more complex control (formatting, spreadsheets) using Windows COM components

EDIT: I have just seen this site. The PDF tutorial seems to be very detailed. Never used this.

bgbg
A: 

Here's one approach. I made the simplifying assumption that there is a small finite limit to the possible number of observations, so I just loop from 1 to 6 explicitly. You can easily expand the upper limit of the loop, although if you go past 9 the logic in the get_obs function will need to change. You could also write something more complex to first scan through all the data and get all the possible observation names, but I didn't want to put in that effort if it's not necessary.

This could be somewhat simplified if you used a dictionary instead of a list of tuples to hold the observation data for each row.

data = [[59000, 59500, 'chr1', 
    [('cn_04', '1.362352462'), ('cn_01', '1.802001235')]], 
    [100000, 110000, 'chr1', 
        [('cn_03', '1.887268908'), ('cn_02', '1.990457407'), ('cn_01', '4.302275763')]],
    [63500, 64000, 'chr1', 
        [('cn_03', '1.887268908'), ('cn_02', '1.990457407'), ('cn_01', '4.302275763')]]
  ]

def get_obs( num, obslist ):
  keyval = 'cn_0' + str(num)
  for obs in obslist:
    if obs[0] == keyval:
      return obs[1]
  return "."

for data_row in data:
  output_row = ""
  for obs in range(1,7):
    output_row += get_obs( obs, data_row[3] ) + '\t'
  output_row += str(data_row[0]) + '\t'
  output_row += str(data_row[1]) + '\t'
  output_row += str(data_row[2])
  print output_row
Dave Costa
I love this answer! It looks beautiful, exactly what I needed. Thank you so much.
Jill Jo
A: 

Never do these types of nested lists/dictionary, they are not pythonic and are very likely to bring you to an error.

Instead, either use a class:

>>> class Gene:
       def __init__(self, start, end, chromosome, transcripts):
           self.start = start
           self.end = end
           self.chromosome = chromosome
           self.transcripts = transcripts
>>> gene1 = Gene(59000, 59500, 'chr1', [('cn_04', '1.362352462'), ('cn_01', '1.802001235')])
>>> gene2 = Gene(100000, 110000, 'chr1', [('cn_03', '1.887268908'), ('cn_02', '1.990457407'), ('cn_01', '4.302275763')])
>>> genes = [gene1, gene2, ...]
>>> gene1.start
59000
>>> genes[1].start
59000

or either use numpy's recordarrays and matrixes.

To read and write CSV file you can use numpy's recarrays and functions.

>>> from matplotlib.mlab import csv2rec, rec2csv
>>> import numpy as np
>>> d = array([(0, 10, 'chr1', [1, 2]), (20, 30, 'chr2', [1,2])], dtype=[('start', int), ('end', int), ('chromosome', 'S8'), ('transcripts', list)])

# all values in the 'chromosome' column
>>> d['chromosome']
array(['chr1', 'chr2'], 
      dtype='|S8')

# records in which chromosome == 1
>>> d[d['chromosome'] == 'chr1']   

# print first record
>>> d[0]
(0, 10, 'chr1', [1, 2])

# save it to a csv file:
>>> rec2csv(d, 'csvfile.txt', delimiter='\t')
dalloliogm
Your initial comment is nonsense. How are nested lists 'not Pythonic'? How is using a third-party library like numpy more Pythonic than using Python's built-in features?
Daniel Roseman
I said that because I know what the user wanted to ask and why. A few years ago I was in the same situation and I can tell you that it is using the wrong approach.In any case, the standard way to read and write CSV files is with the csv module, or with numpy's recarrays which are an extension of that.Using list of lists that way is not pythonic, is more perlist, because in python you have better data structures to handle these situations and you also have objects.
dalloliogm
+2  A: 

For sending data to Excel, I would use CSV instead of a fixed-length text format; that way, if it turns out (say) that you need more significant figures in your float values, the format of your output doesn't change. Also, you can just open CSV files in Excel; you don't have to import them. And the csv.writer deals with all of the data-type conversion issues for you.

I'd also take advantage of the (apparent) fact that the 4th item in each observation appears to be a set of key/value pairs, which the dict function can turn into a dictionary. Assuming that you know what all of the keys are, you can specify the order that you want them to appear in your output simply by putting them in a list (called keys in the below code). Then it's simple to create an ordered list of values with a list comprehension. Thus:

>>> import sys
>>> import csv
>>> keys = ['cn_01', 'cn_02', 'cn_03', 'cn_04', 'cn_05', 'cn_06']
>>> data = [[59000, 59500, 'chr1', [('cn_04', '1.362352462'), ('cn_01', '1.802001235')]], [100000,   110000, 'chr1', [('cn_03', '1.887268908'), ('cn_02', '1.990457407'), ('cn_01', '4.302275763')]], [63500, 64000, 'chr1', [('cn_03', '1.887268908'), ('cn_02', '1.990457407'), ('cn_01', '4.302275763')]]]
>>> writer = csv.writer(sys.stdout)
>>> writer.writerow(keys + ['start', 'stop', 'chromosome'])
cn_01,cn_02,cn_03,cn_04,cn_05,cn_06,start,stop,chromosome
>>>>for obs in data:
        d = dict(obs[3])
        row = [d.get(k, None) for k in keys] + obs[0:3]
        writer.writerow(row)

1.802001235,,,1.362352462,,,59000,59500,chr1
4.302275763,1.990457407,1.887268908,,,,100000,110000,chr1
4.302275763,1.990457407,1.887268908,,,,63500,64000,chr1

The above writes the data to sys.stdout; to create a real CSV file you'd do something like:

with open('file.csv', 'w') as f:
    writer = csv.writer(f)
    # now use the writer to write out the data
Robert Rossney
A: 

You can also use xlwt to write .xls files directly, without touching Excel. More info.

Here is some sample code to get you started (far from perfect):

import xlwt as xl
def list2xls(data, fn=None, col_names=None, row_names=None):
        wb = xl.Workbook()
        ws = wb.add_sheet('output')
        if col_names:
            _write_1d_list_horz(ws, 0, 1, col_names)
        if row_names:
            _write_1d_list_vert(ws, 1, 0, row_names)
        _write_matrix(ws, 1, 1, data)
        if not fn:
            fn = 'test.xls'
        wb.save(fn)
    def _write_matrix(ws, row_start, col_start, mat):
        for irow, row in enumerate(mat):
            _write_1d_list_horz(ws, irow + row_start, col_start, row)
    def _write_1d_list_horz(ws, row, col, list):
        for i, val in enumerate(list):
            ws.write(row, i + col, val)
    def _write_1d_list_vert(ws, row, col, list):
        for i, val in enumerate(list):
            ws.write(row + i, col, val)

Call list2xls, with data as a 2-d list, and optional column and row names as lists.

nazca