views:

420

answers:

7

hey guys, I'm new in python and I have a problem. I have some measured data saved in a txt file. the data is separated with tabs, it has this structure:

0   0 -11.007001 -14.222319 2.336769

i have always 32 datapoints per simulation (0,1,2,...,31) and i have 300 simulations (0,1,2...,299), so the data is sorted at first with the number of simulation and then the number of the data point.

The first column is the simulation number, the second column is the data point number and the other 3 columns are the x,y,z coordinates.

I would like to create a 3d array, the first dimension should be the simulation number, the second the number of the datapoint and the third the three coordinates.

I already started a bit and here is what I have so far:

## read file
coords = [x.split('\t') for x in
          open(f,'r').read().replace('\r','')[:-1].split('\n')]
## extract the information you want
simnum = [int(x[0]) for x in coords]
npts = [int(x[1]) for x in coords]
xyz = array([map(float,x[2:]) for x in coords])

but I don't know how to combine these 2 lists and this one array.

in the end i would like to have something like this:

array = [simnum][num_dat_point][xyz]

thanks for your help.

I hope you understand my problem, it's my first posting in a python forum, so if I did anything wrong, I'm sorry about this.

thanks again

+2  A: 

you can combine them with zip function, like so:

for sim, datapoint, x, y, z in zip(simnum, npts, *xyz):
    # do your thing

or you could avoid list comprehensions altogether and just iterate over the lines of the file:

for line in open(fname):
    lst = line.split('\t')
    sim, datapoint = int(lst[0]), int(lst[1])
    x, y, z = [float(i) for i in lst[2:]]
    # do your thing

to parse a single line you could (and should) do the following:

coords = [x.split('\t') for x in open(fname)]
SilentGhost
You say your iterative example "[avoids] list comprehensions altogether", yet you use one in the second-to-last line.
JAB
and you don't see what the difference between my use of lc's and OPs?
SilentGhost
@Cat: `map(float, lst[2:])` will do.
Roberto Bonvallet
@Roberto: Indeed. I was merely pointing out to SilentGhost that his code technically still had a list comprehension in it, so his wording was not completely correct.
JAB
The fileinput module could also be used in this situation. It would be interesting to see if it were faster (I'm 90% sure it would be).
Vince
+2  A: 

According to the zen of python, flat is better than nested. I'd just use a dict.

import csv
f = csv.reader(open('thefile.csv'), delimiter='\t',
               quoting=csv.QUOTE_NONNUMERIC)

result = {}
for simn, dpoint, c1, c2, c3 in f:
    result[simn, dpoint] = c1, c2, c3

# pretty-prints the result:
from pprint import pprint
pprint(result)
nosklo
You forgot to convert the data to int or float as appropriate.
John Machin
@John Machin: fixed
nosklo
Not fixed; the horrible QUOTE_NONNUMERIC gimmick implicitly converts **ALL** the OP's data to float.
John Machin
+2  A: 

This seems like a good opportunity to use itertools.groupby.

import itertools
import csv
file = open("data.txt")
reader = csv.reader(file, delimiter='\t')
result = []
for simnumberStr, rows in itertools.groupby(reader, key=lambda t: t[0]):
    simData = []
    for row in rows:
        simData.append([float(v) for v in row[2:]])
    result.append(simData)
file.close()

This will create a 3 dimensional list named 'result'. The first index is the simulation number, and the second index is the data index within that simulation. The value is a list of integers containing the x, y, and z coordinate.

Note that this assumes the data is already sorted on simulation number and data number.

Greg
greg - what version of python are you using that the csv.reader function takes a delimeter arguement? My python 2.6 doesn't do that, was it added in python 3?
Petriborg
Petriborg -- I'm using version Python 2.5.2. It's also documented here: http://docs.python.org/library/csv.html
Greg
@Petriborg, @Greg: It works better if you spell delimiter correctly.
jcdyer
@jcd -- thanks!
Greg
+1  A: 

essentially the difficulty is what happens if different simulations have different numbers of points.

You will therefore need to dimension an array to the appropriate sizes first. t should be an array of at least max(simnum) x max(npts) x 3. To eliminate confusion you should initialise with not-a-number, this will allow you to see missing points.

then use something like

for x in coords:
  t[int(x[0])][int(x[1])][0]=float(x[3])
  t[int(x[0])][int(x[1])][1]=float(x[4])
  t[int(x[0])][int(x[1])][2]=float(x[5])

is this what you meant?

Sanjay Manohar
every simulation has exactly the same number of points.
steffen
ok good. Then this code should work fine with, say, t=[[[0,0,0] for i in range(32)] for j in range(300)]
Sanjay Manohar
A: 

First I'd point out that your first data point appears to be an index, and wonder if the data is therefore important or not, but whichever :-)

def parse(line):
    mch = re.compile('^(\d+)\s+(\d+)\s+([-\d\.]+)\s+([-\d\.]+)\s+([-\d\.]+)$')
    m = mch.match(line)
    if m:
        l = m.groups()
        (idx,data,xyz) = (int(l[0]),int(l[1]), map(float, l[2:]))
        return (idx, data, xyz)
    return None

finaldata = []
file = open("data.txt",'r')
for line in file:
    r = parse(line)
    if r is not None:
        finaldata.append(r)

Final data should have output along the lines of:

[(0, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
 (1, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
 (2, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
 (3, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
 (4, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999])]

This should be pretty robust about dealing w/ the whitespace issues (tabs spaces whatnot)...

I also wonder how big your data files are, mine are usually large so being able to process them in chunks or groups become more important... Anyway this will work in python 2.6.

Petriborg
+1  A: 

You could be using many different kinds of containers for your purposes, but none of them has array as an unqualified name -- Python has a module array which you can import from the standard library, but the array.array type is too limited for your purposes (1-D only and with elementary types as contents); there's a popular third-party extension known as numpy, which does have a powerful numpy.array type, which you could use if you has downloaded and installed the extension -- but as you never even once mention numpy I doubt that's what you mean; the relevant builtin types are list and dict. I'll assume you want any container whatsoever -- but if you could learn to use precise terminology in the future, that will substantially help you AND anybody who's trying to help you (say list when you mean list, array only when you DO mean array, "container" when you're uncertain about what container to use, and so forth).

I suggest you look at the csv module in the standard library for a more robust way to reading your data, but that's a separate issue. Let's start from when you have the coords list of lists of 5 strings each, each sublist with strings representing two ints followed by three floats. Two more key aspects need to be specified...

One key aspect you don't tell us about: is the list sorted in some significant way? is there, in particular, some significant order you want to keep? As you don't even mention either issue, I will have to assume one way or another, and I'll assume that there isn't any guaranteed nor meaningful order; but, no repetition (each pair of simulation/datapoint numbers is not allowed to occur more than once).

Second key aspect: are there the same number of datapoints per simulation, in increasing order (0, 1, 2, ...), or is that not necessarily the case (and btw, are the simulation themselves numbered 0, 1, 2, ...)? Again, no clue from you on this indispensable part of the specs -- note how many assumptions you're forcing would-be helpers to make by just not telling us about such obviously crucial aspects. Don't let people who want to help you stumble in the dark: rather, learn to ask questions the smart way -- this will save untold amounts of time to yourself AND would-be helpers, and give you higher-quality and more relevant help, so, why not do it? Anyway, forced to make yet another assumption, I'll have to assume nothing at all is known about the simulation numbers nor about the numers of datapoints in each simulation.

With these assumptions dict emerges as the only sensible structure to use for the outer container: a dictionary whose key is a tuple with two items, simulation number then datapoint number within the simulation. The values may as well be tuple, too (with three floats each), since it does appear that you have exactly 3 coordinates per line.

With all of these assumptions...:

def make_container(coords):
  result = dict()
  for s, d, x, y, z in coords:
    key = int(s), int(d)
    value = float(x), float(y), float(z)
    result[key] = value
  return result

It's always best, and fastest, to have all significant code within def statements (i.e. as functions to be called, possibly with appropriate arguments), so I'm presenting it this way. make_container returns a dictionary which you can address with the simulation number and datapoint number; for example,

d = make_container(coords)
print d[0, 0]

will print the x, y, z for dp 0 of sim 0, assuming one exists (you would get an error if such a sim/dp combination did not exist). dicts have many useful methods, e.g. changing the print statement above to

print d.get((0, 0))

(yes, you do need double parentheses here -- inner ones to make a tuple, outer ones to call get with that tuple as its single argument), you'd see None, rather than get an exception, if there was no such sim/dp combinarion as (0, 0).

If you can edit your question to make your specs more precise (perhaps including some indication of ways you plan to use the resulting container, as well as the various key aspects I've listed above), I might well be able to fine-tune this advice to match your need and circumstances much better (and so might ever other responder, regarding their own advice!), so I strongly recommend you do so -- thanks in advance for helping us help you!-)

Alex Martelli
i specified my post.i have always 32 datapoints per simulation (0,1,2,...,31) and i have 300 simulations (0,1,2...,299), so the data is sorted at first with the number of simulation and then the number of the data point.i hope that helps and thanks for the help, that is a really good help.sorry for my unsmart way of posting but i'll try to make it better in the future.thanks again
steffen
NP, and I see you've accepted an answer already so this must mean that answer has solved your problem, so I'm glad for you!
Alex Martelli
A: 

Are you sure a 3d array is what you want? It seems more likely that you want a 2d array, where the simulation number is one dimension, the data point is the second, and then the value stored at that location is the coordinates.

This code will give you that.

data = []
for coord in coords:
    if coord[0] not in data:
        data[coord[0]] = []
    data[coord[0]][coord[1]] = (coord[2], coord[3], coord[4])

To get the coordinates at simulation 7, data point 13, just do data[7][13]

Michael Fairley