tags:

views:

290

answers:

6

How can i get a particular line in a 3 gig text file. The lines are delimited by \n. And i need to be able to get any line on demand.

How can this be done? Only one line need be returned.

Update: If it matters, all the lines have the same length

+2  A: 

If it's not a fixed-record-length file and you don't do some sort of indexing on the line starts, your best bet is to just use:

head -n N filespec | tail -1

where N is the line number you want.

This isn't going to be the best-performing piece of code for a 3Gb file unfortunately but there are ways to make it better.

If the file doesn't change too often, you may want to consider indexing it. By that I mean having another file with the line offsets in it as fixed length records.

So the file:

0000000000
0000000017
0000000092
0000001023

would give you an fast way to locate each line. Just multiply the desired line number by the index record size and seek to there in the index file.

Then use the value at that location to seek in the main file so you can read until the next newline character.

So for line 3, you would seek to 33 in the index file (index record length is 10 characters plus one more for the newline). Reading the value there, 0000000092, would give you the offset to use into the main file.

Of course, that's not so useful if the file changes frequently although, if you can control what happens when things get appended, you can still add offsets to the index efficiently. If you don't control that, you'll have to re-index whenever the last-modified date of the index is earlier than that of the main file.


And, based on your update:

Update: If it matters, all the lines have the same length.

With that extra piece of information, you don't need the index - you can just seek immediately to the right location in the main file by multiplying the record length by the record length (assuming the values fit into your data types).

So something like the pseudo-code:

def getline(fhandle,reclen,recnum):
    seek to position reclen*recnum for file fhandle.
    read reclen characters into buffer.
    return buffer.
paxdiablo
camh has a better solution, but I'll leave this one here for the case where the records aren't fixed length.
paxdiablo
Thanks for this, very informative.
JavaRocky
+4  A: 

head -10 file | tail -1 returns line 10 probably slow though.

from here

# print line number 52 
sed -n '52p' # method 1 
sed '52!d' # method 2 
sed '52q;d' # method 3, efficient on large files
Paul Creasey
+9  A: 

If all the lines have the same length, the best way by far will be to use dd(1) and give it a skip parameter.

Let the block size be the length of each line (including the newline), then you can do:

$ dd if=filename bs=<line-length> skip=<line_no - 1> count=1 2>/dev/null

The idea is to seek past all the previous lines (skip=<line_no - 1>) and read a single line (count=1). Because the block size is set to the line length (bs=<line-length>), each block is effectively a single line. Redirect stderr so you don't get the annoying stats at the end.

That should be much more efficient than streaming the lines before the one you want through a program to read all the lines and then throw them away, as dd will seek to the position you want in the file and read only one line of data from the file.

camh
+1. Basically the same as my later solution after the extra fixed-record-size snippet was added to the question, but has the distinct advantage of not needing to write your own program.
paxdiablo
That's so nerdy. Heh. dd, i like it.
JavaRocky
A: 

An awk alternative, where 3 is the line number.

awk 'NR == 3 {print; exit}' file.txt
Jamie
better to print and exit, so awk doesn't go through the rest of the file.
ghostdog74
Very good point
Jamie
+1  A: 

A quick perl one liner would work well for this too...

$ perl -ne 'if (YOURLINENUMBER..YOURLINENUMBER) {print $_; last;}' /path/to/your/file
Eld
A: 

Use q with sed to make the search stop after the line has been printed.

sed -n '11723{p;q}' filename

Python (minimal error checking):

#!/usr/bin/env python
import sys

# by Dennis Williamson - 2010-05-08
# for http://stackoverflow.com/questions/2794049/getting-one-line-in-a-huge-file-with-bash

# seeks the requested line in a file with a fixed line length

# Usage: ./lineseek.py LINE FILE

# Example: ./lineseek 11723 data.txt

EXIT_SUCCESS      = 0
EXIT_NOT_FOUND    = 1
EXIT_OPT_ERR      = 2
EXIT_FILE_ERR     = 3
EXIT_DATA_ERR     = 4

# could use a try block here
seekline = int(sys.argv[1])

file = sys.argv[2]

try:
    if file == '-':
        handle = sys.stdin
        size = 0
    else:
        handle = open(file,'r')
except IOError as e:
    print >> sys.stderr, ("File Open Error")
    exit(EXIT_FILE_ERR)

try:
    line = handle.readline()
    lineend = handle.tell()
    linelen = len(line)
except IOError as e:
    print >> sys.stderr, ("File I/O Error")
    exit(EXIT_FILE_ERR)

# it would be really weird if this happened
if lineend != linelen:
    print >> sys.stderr, ("Line length inconsistent")
    exit(EXIT_DATA_ERR)

handle.seek(linelen * (seekline - 1))

try:
    line = handle.readline()
except IOError as e:
    print >> sys.stderr, ("File I/O Error")
    exit(EXIT_FILE_ERR)

if len(line) != linelen:
    print >> sys.stderr, ("Line length inconsistent")
    exit(EXIT_DATA_ERR)

print(line)

Argument validation should be a lot better and there is room for many other improvements.

Dennis Williamson