Can I reset an iterator / generator in Python? I am using DictReader and would like to reset it (from the csv module) to the beginning of the file.
thanks.
Can I reset an iterator / generator in Python? I am using DictReader and would like to reset it (from the csv module) to the beginning of the file.
thanks.
Only if the underlying type provides a mechanism for doing so (e.g. fp.seek(0)
).
No. Python's iterator protocol is very simple, and only provides one single method (.next()
or __next__()
), and no method to reset an iterator in general.
The common pattern is to instead create a new iterator using the same procedure again.
If you want to "save off" an iterator so that you can go back to its beginning, you may also fork the iterator by using itertools.tee
If you have a csv file named 'blah.csv' That looks like
a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6
you know that you can open the file for reading, and create a DictReader with
blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)
Then, you will be able to get the next line with reader.next(), which should output
{'a':1,'b':2,'c':3,'d':4}
using it again will produce
{'a':2,'b':3,'c':4,'d':5}
However, at this point if you use blah.seek(0), the next time you call reader.next() you will get
{'a':1,'b':2,'c':3,'d':4}
again. This seems to be the functionality you're looking for. I'm sure there are some tricks associated with this approach that I'm not aware of however. @Brian suggested simply creating another DictReader. This won't work if you're first reader is half way through reading the file, as your new reader will have unexpected keys and values from wherever you are in the file.
While there is no iterator reset, the "itertools" module from python 2.6 (and later) has some utilities that can help there. One of then is the "tee" which can make multiple copies of an iterator, and cache the results of the one running ahead, so that these results are used on the copies. I will seve your purposes:
>>> def printiter(n):
... for i in xrange(n):
... print "iterating value %d" % i
... yield i
>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]
I see many answers suggesting itertools.tee, but that's ignoring one crucial warning in the docs for it:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use
list()
instead oftee()
.
Basically, tee
is designed for those situation where two (or more) clones of one iterator, while "getting out of sync" with each other, don't do so by much -- rather, they say in the same "vicinity" (a few items behind or ahead of each other). Not suitable for the OP's problem of "redo from the start".
L = list(DictReader(...))
on the other hand is perfectly suitable, as long as the list of dicts can fit comfortably in memory. A new "iterator from the start" (very lightweight and low-overhead) can be made at any time with iter(L)
, and used in part or in whole without affecting new or existing ones; other access patterns are also easily available.
As several answers rightly remarked, in the specific case of csv
you can also .seek(0)
the underlying file object (a rather special case). I'm not sure that's documented and guaranteed, though it does currently work; it would probably be worth considering only for truly huge csv files, in which the list
I recommmend as the general approach would have too large a memory footprint.
There's a bug in using .seek(0) as advocated by Alex Martelli and Wilduck above, namely that the next call to .next() will give you a dictionary of your header row in the form of {key1:key1, key2:key2, ...}. The work around is to follow file.seek(0) with a call to reader.next() to get rid of the header row.
So your code would look something like this:
f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)
for record in reader:
if some_condition:
# reset reader to first row of data on 2nd line of file
f_in.seek(0)
reader.next()
continue
do_something(record)
This is maybe considered not answering question, but for me first question is: Am I asking the right question?
For me the csv files are mostly readable without csv, at least when not using quotes etc.
You can use instead of csv module simple code snippet like this mine which I posted in DaniWeb: text file based information access by field name and number
Then you can access the info any way and direction you like.
Second thing is that, I am interested to see exactly the reason arise for this activity. Maybe the program logic is not the best possible and the need is just reflection of the overlooked thing? Has happened to me at least, that I had to re-examine my pseudo code and correct it instead of doing strange program.