views:

390

answers:

8

A Python program I'm writing is to read a set number of lines from the top of a file, and the program needs to preserve this header for future use. Currently, I'm doing something similar to the following:

    header = ''
    header_len = 4
    for i in range(1, header_len):
        header += file_handle.readline()

Pylint complains that I'm not using the variable i. What would be a more pythonic way to do this?

Edit: The purpose of the program is to intelligently split the original file into smaller files, each of which contains the original header and a subset of the data. So, I need to read and preserve just the header before reading the rest of the file.

Edit 2: Changed the question title (from "Pythonic way to read a set number of lines from a file") since similar questions were coming up and apparently not getting referred to this one.

A: 

May be this:

header_len = 4
header = open("file.txt").readlines()[:header_len]

But, it will be troublesome for long files.

mshsayem
.readlines() reads the entire file, though.. if you have a large file and don't want to read the whole thing into memory, this could be a bad idea
David Claridge
yeah, I have added that while you were writing this, ;)
mshsayem
if only readlines() were lazy!
David Claridge
@david : guido please make it lazy lazy very lazy...http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python
TheMachineCharmer
There's no need, now that we have `itertools.islice`.
Robert Rossney
+1 for simplicity and OP can use the rest of the list items easily to split into smaller files. readlines() does read the entire file, but I am not going to -1 you for that, since we don't know if OP's files are that big in the GB range, so it might still be ok for OP to use this method.
Robert is right
mshsayem
+6  A: 

I'm not sure what the Pylint rules are, but you could use the '_' throwaway variable name.

header = ''
header_len = 4
for _ in range(1, header_len):
    header += file_handle.readline()
David Claridge
You don't need to use the for loop. I recommend a list comprehension (see my post below). Good call on the throwaway variable, though.
Arrieta
@Roger Pate: can you explain?
Arrieta
@Arrieta what is wrong with for loops?
@unknown, there's nothing wrong with using for loops. for loops are integral part of Python and are basic concepts of programming. If somebody says otherwise not to use it, tell them to take a hike
You learn something new everyday - I didn't know about the _ variable. Thanks! +1
GreenMatt
+9  A: 
import itertools

header_lines = list(itertools.islice(file_handle, header_len))
# or
header = "".join(itertools.islice(file_handle, header_len))

Note that with the first, the newline chars will still be present, to strip them:

header_lines = list(n.rstrip("\n")
                    for n in itertools.islice(file_handle, header_len))
Roger Pate
If you strip the lines it will be difficult to recall the structure of the original header. I recommend you keep them.
Arrieta
No, it won't. In that example they are stored in a list rather than one long string. Which he should use depends on what he's doing with the data later.
Roger Pate
The OP writes in his script 'header += ...' so I think he meant a single string, but you are right: it depends.
Arrieta
itertools? what wrong with for line in f?
Anurag Uniyal
This is COOL. +1
mshsayem
Arrieta: that's why I used separate header and header\_lines variables.
Roger Pate
Anurag: your own answer doesn't even use "for line in f", nor do any of the answers I currently see iterate the file directly---if anything, itertools is the only solution here that uses the file as an iterator and is thus the closest answer to "for line in f".
Roger Pate
+1  A: 

My best answer is as follows:

file test.dat:

This is line 1
This is line 2
This is line 3
This is line 4
This is line 5
This is line 6
This is line 7
This is line 8
This is line 9

Python script:

f = open('test.dat')
nlines = 4
header = "".join(f.readline() for _ in range(nlines))

Output:

>>> header
'This is line 1\nThis is line 2\nThis is line 3\nThis is line 4\n'

Notice that you don't need to call any modules; also that you could use any dummy variable in place of _ (it works with i, or j, or ni, or whatever) but I recomend you don't (to avoid confusion). You could strip the newline characters (though I don't recommend you do - this way you can distinguish among lines) or do anything that you can do with strings in Python.

Notice that I did not provide a mode for opening the file, so it defaults to "read only" - this is not Pythonic; in Python "explicit is better than implicit". Finally, nice people close their files; in this case it is automatic (because the script ends) but it is best practice to close them using f.close().

Happy Pythoning.

Edit: As pointed out by Roger Pate the square brackets are unnecessary in the list comprehension, thereby reducing the line by two characters. The original script has been edited to reflect this.

Arrieta
When you don't actually need a list and any iterable will work, such as the parameter to `"".join` here, then a generator expression is better, easier (by two keystrokes ;), and more clear than a list comprehension: `"".join(..)` instead of `"".join([..])`. They are related, and a LC is actually a special case of a genexp (in my view at least), where `[..]` is just convenience for `list(..)`. http://www.python.org/dev/peps/pep-0289/
Roger Pate
This is great - every day you learn: +1
Arrieta
close you file handle
@levislevis85: read the post
Arrieta
yes i did read. I still want you to close it for the benefit of others who only want to see code and doesn't want to read.
@Arrieta: Did NASA approve your use of their logo? ;-p
GreenMatt
A: 
s=""
f=open("file")
for n,line in enumerate(f):
  if n<=3 : s=s+line
  else:
      # do something here to process the rest of the lines          
print s
f.close()
He seems to want the result in a single string (notice he writes header += ...)
Arrieta
By He I mean the OP
Arrieta
I think this implementation is overly complicated for such a simple task; it reads like C on Python - take advantage of the "Batteries Included" philosophy and use the existing methods on the objects.
Arrieta
overly complicated?? what criteria do you use to judge?? number of characters of code? number of lines of code?? Batteries included?? What kind of batteries are you talking about that i am not using? you can test my code versus your code with millions of lines, and they both perform on par. So what's the deal?
The "Batteries Included" is a motto of the Python Language (cf. website) "Fans of Python use the phrase "batteries included" to describe the standard library". What I mean is that your style is not taking advantage of the Standard Library and, by doing so, you are reinventing the wheel. This is not in line with Python's philosophy. By reinventing the wheel you condemn others to understand your logic (which could be difficult in some cases): by using the Standard Library you can express your ideas at a higher level of abstraction and don't distract your code logic with wheel reinventions.
Arrieta
No need in going around downvoting - this is a place to learn and you cannot get offended by people commenting on your code. If you cannot stand the heat, keep out of the kitchen.
Arrieta
I've program Python since ver 1.5 and i do know what batteries included mean. So If I use std library, you would understand right away what i am writing? For eg, itertools. Older ver of Python may not have it. Also, for this simple task, there is no need to use it or other libraries. Sometimes, going down to the basics is still advantageous. If people don't understand what i write as in my solution, i can only say they are not understanding their basics. I can stand the heat, but not when the comments are ridiculous and based on subjective personal opinions and not tackling the problem at hand
finally i want to say, I use standard libraries when i need to. Other than that as for this OP's case, its so simple, there's no need to use one.
A: 

I do not see any thing wrong with your solution, may be just replace i with _, I also do not like invoking itertools everywhere where simpler solution will work, it is like people using jQuery for trivial javascript tasks. anyway just to have itertools revenge here is my solution

as you want to read whole file anyway line by line, why not just first read header and after that do whatever you want to do

header = ''
header_len = 4

for i, line in enumerate(file_handle):
    if i < header_len:
        header += line
    else:
        # output chunks to separate files
        pass

print header
Anurag Uniyal
A: 

What about:

header = []
for i,l in enumerate(file_handle):
    if i <= 3: 
         header += l
         continue
    #proc rest of file here
Claudiu
+3  A: 
f = open('fname')
header = [next(f) for _ in range(header_len)]

Since you're going to write header back to the new files, you don't need to do anything with it. To write it back to the new file:

open('new', 'w').writelines(header + list_of_lines)

if you know the number of lines in the old file, list_of_lines would become:

list_of_lines = [next(f) for _ in range(chunk_len)]
SilentGhost
Straight forward, easily understandable, and eliminates the pylint complaint. Thus it's the best answer, IMO.
GreenMatt