views:

185

answers:

4

I have a 100 GB text file, which is a BCP dump from a database. When I try to import it with BULK INSERT, I get a cryptic error on line number 219506324. Before solving this issue I would like to see this line, but alas my favorite method of

import linecache
print linecache.getline(filename, linenumber)

is throwing a MemoryError. Interestingly the manual says that "This function will never throw an exception." On this large file it throws one as I try to read line number 1, and I have about 6GB free RAM...

I would like to know what is the most elegant method to get to that unreachable line. Available tools are Python 2, Python 3 and C# 4 (Visual Studio 2010). Yes, I understand that I can always do something like

var line = 0;
using (var stream = new StreamReader(File.OpenRead(@"s:\source\transactions.dat")))
{
     while (++line < 219506324) stream.ReadLine(); //waste some cycles
     Console.WriteLine(stream.ReadLine());
}

Which would work, but I doubt it's the most elegant way.

EDIT: I'm waiting to close this thread, because the hard drive containing the file is being used right now by another process. I'm going to test both suggested methods and report timings. Thank you all for your suggestions and comments.

The Results are in I implemented Gabes and Alexes methods to see which one was faster. If I'm doing anything wrong, do tell. I'm going for the 10 millionth line in my 100GB file using the method Gabe suggested and then using the method Alex suggested which i loosely translated into C#... The only thing I'm adding from myself, is first reading in a 300 MB file into memory just to clear the HDD cache.

const string file = @"x:\....dat"; // 100 GB file
const string otherFile = @"x:\....dat"; // 300 MB file
const int linenumber = 10000000;

ClearHDDCache(otherFile);
GabeMethod(file, linenumber);  //Gabe's method

ClearHDDCache(otherFile);
AlexMethod(file, linenumber);  //Alex's method

// Results
// Gabe's method: 8290 (ms)
// Alex's method: 13455 (ms)

The implementation of gabe's method is as follows:

var gabe = new Stopwatch();
gabe.Start();
var data = File.ReadLines(file).ElementAt(linenumber - 1);
gabe.Stop();
Console.WriteLine("Gabe's method: {0} (ms)",  gabe.ElapsedMilliseconds);

While Alex's method is slightly tricker:

var alex = new Stopwatch();
alex.Start();
const int buffersize = 100 * 1024; //bytes
var buffer = new byte[buffersize];
var counter = 0;
using (var filestream = File.OpenRead(file))
{
    while (true) // Cutting corners here...
    {
        filestream.Read(buffer, 0, buffersize);
        //At this point we could probably launch an async read into the next chunk...
        var linesread = buffer.Count(b => b == 10); //10 is ASCII linebreak.
        if (counter + linesread >= linenumber) break;
        counter += linesread;
    }
}
//The downside of this method is that we have to assume that the line fit into the buffer, or do something clever...er
var data = new ASCIIEncoding().GetString(buffer).Split('\n').ElementAt(linenumber - counter - 1);
alex.Stop();
Console.WriteLine("Alex's method: {0} (ms)", alex.ElapsedMilliseconds);

So unless Alex cares to comment I'll mark Gabe's solution as accepted.

+5  A: 

Well, memory can run out at any time, asynchronously and unpredictably -- that's why the "never an exception" promise doesn't really apply there (just like, say, in Java, where every method must specify which exceptions it can raise, some exceptions are exempted from this rule, since just about any method, unpredictably, can raise them at any time due to resource scarcity or other system-wide issues).

linecache tries to read the whole file. Your only simple alternative (hopefully you're not in a hurry) is to read one line at a time from the start...:

def readoneline(filepath, linenum):
    if linenum < 0: return ''
    with open(filepath) as f:
        for i, line in enumerate(filepath):
            if i == linenum: return line
        return ''

Here, linenum is 0-based (if you don't like that, and your Python is 2.6 or better, pass a starting value of 1 to enumerate), and the return value is the empty string for invalid line numbers.

Somewhat faster (and a lot more complicated) is to read, say, 100 MB at a time (in binary mode) into a buffer; count the number of line-ends in the buffer (just a .count('\n') call on the buffer string object); once the running total of line ends exceeds the linenum you're looking for, find the Nth line-end currently in the buffer (where N is the difference between linenum, here 1-based, and the previous running total of line ends), read a bit more if the N+1st line-end is not also in the buffer (as that's the point where your line ends), extract the relevant substring. Not just a couple of lines net of the with and returns for anomalous cases...;-).

Edit: since the OP comments doubting that reading by-buffer instead of by-line can make a performance difference, I unrooted an old piece of code where I was measuring the two approaches for a somewhat-related tasks -- counting the number of lines with the buffer approach, a loop on lines, or reading the whole file in memory at one gulp (by readlines). The target file is kjv.txt, the standard English text of the King James' Version of the Bible, one line per verse, ASCII:

$ wc kjv.txt 
  114150  821108 4834378 kjv.txt

Platform is a MacOS Pro laptop, OSX 10.5.8, Intel Core 2 Duo at 2.4 GHz, Python 2.6.5.

The module for the test, readkjv.py:

def byline(fn='kjv.txt'):
    with open(fn) as f:
        for i, _ in enumerate(f):
            pass
    return i +1

def byall(fn='kjv.txt'):
    with open(fn) as f:
        return len(f.readlines())

def bybuf(fn='kjv.txt', BS=100*1024):
    with open(fn, 'rb') as f:
        tot = 0
        while True:
            blk = f.read(BS)
            if not blk: return tot
            tot += blk.count('\n')

if __name__ == '__main__':
    print bybuf()
    print byline()
    print byall()

The prints are just to confirm correctness of course (and do;-).

The measurement, of course after a few dry runs to ensure everybody's benefitting equally from the OS's, disk controller's, and filesystem's read-ahead functionality (if any):

$ py26 -mtimeit -s'import readkjv' 'readkjv.byall()'
10 loops, best of 3: 40.3 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.byline()'
10 loops, best of 3: 39 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
10 loops, best of 3: 25.5 msec per loop

The numbers are quite repeatable. As you see, even on such a tiny file (less than 5 MB!), by-line approaches are slower than buffer-based ones -- just too much wasted effort!

To check scalability, I next used a 4-times-larger file, as follows:

$ cat kjv.txt kjv.txt kjv.txt kjv.txt >k4.txt
$ wc k4.txt
  456600 3284432 19337512 k4.txt
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
10 loops, best of 3: 25.4 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf("k4.txt")'
10 loops, best of 3: 102 msec per loop

and, as predicted, the by-buffer approach scales just about exactly linearly. Extrapolating (always a risky endeavour, of course;-), a bit less than 200 MB per second seems to be the predictable performance -- call it 6 seconds per GB, or maybe 10 minutes for 100 GB.

Of course what this small program does is just line counting, but (once there is enough I/O t amortize constant overheads;-) a program to read a specific line should have similar performance (even though it needs more processing once it's found "the" buffer of interest, it's a roughly constant amount of processing, for a buffer of a given size -- presumably repeated halving of the buffer to identify a small-enough part of it, then a little bit of effort linear in the size of the multiply-halved "buffer remainder").

Elegant? Not really... but, for speed, pretty hard to beat!-)

Alex Martelli
Do you really think it would be that much faster to do a low-level line-end count? I'd think that either way would be about as slow as reading in the whole file from disk.
Gabe
When something that calls itself a "cache" proclaims that it will never throw an exception, it implies me that it discards old members from its cache when it runs out of memory. Based on the docs for the function, I would *not* expect it to read all requested files into memory and never release them!
Gabe
@Gabe, I see what you mean wrt the docs, and I recommend you propose the specific patch in Python's docs that would have clarified the issue to you more sharply (see http://www.python.org/dev/patches/ ). Wrt speed, reading a file sequentially 100MB at a time is faster than reading it 100GB at a time unless you have more than 100GB of RAM available (as the latter course then either fails, or thrashes in paging); and it's faster than reading one line at a time as it avoids mucho parsing, object allocation, and so on, that you don't need, _and_ uses a larger buffer.
Alex Martelli
BTW, also see http://stackoverflow.com/questions/2985725/moving-to-an-arbitrary-position-in-a-file-in-python/2985778#2985778 (based on a slightly different case where lines from the files would need to be read repeatedly, so building an index was clearly useful) -- also, I'm pretty sure I did at least once post to SO several time measurements for this kind of buffering approach vs a line-by-line one, but I can't easily find the SO post where I did so, alas.
Alex Martelli
Alex: I just meant that since this process is likely to be I/O-bound, you're not likely to get much speedup by optimizing the line counting.
Gabe
@Gabe, as I said, I've measured -- not on _your_ 100-GB-file, your OS, your filesystem, your specific HW, etc, of course, which is why it's important that you do your own benchmarks. But intuition on performance is often misleading: carefully arranged measurement is a much more reliable guide.
Alex Martelli
I see someone has submitted a patch a while back to improve linecache handling of large files (http://bugs.python.org/issue1708). If someone wanted to review, update, and test it, that could improve the chances of it getting accepted for Python 3:
Ned Deily
It should be reading it in reasonably small blocks, eg. 64k at a time on demand, caching line-number to file-offset mappings to allow quick seeking by line number, and adjusting the distance between cached lines to avoid linearly increasing memory usage, so it works on huge files like this. That's certainly what I'd expect linecache to do based on the documentation, but it's not at all what it does. Looking at the source code, it's a very crude library that I'd suggest doesn't belong in the Python library at all. I even see thread-safety issues on casual inspection.
Glenn Maynard
Glenn: The linechache module was intended for caching source files for printing exception tracebacks, and for this it does quite well. You want it to cache the source code at the time it was parsed by Python, not at the time the exception happened. BTW, that's why this code isn't supposed to throw exceptions -- it's intended to be used by the exception code path.
Gabe
@Gabe, OK -- edited answer to show explicit buffering's vastly superior performance in a simple, easily repeatable benchmark case.
Alex Martelli
Alex, you just proved my point. :) Your buffering method is only 60% faster, and *only* if your disk can supply data at a staggering sustained 200MB/s! (1st gen SATA is only 150MB/s.) Since the general case is that you *cannot* saturate your disk interface for several minutes at a time, there's unlikely to be a measurable difference in the running time for the "faster" algorithm.
Gabe
Apart from comparing algorithm and buffering performance, it would be interesting to compare timing of the Python solution with the OP's C#/.Net solution.
Paul McGuire
@Alex, Maybe there's a typo, "for i, line in enum(filepath):" -> "for i, line in enumerate(f)". by the way, enumerate (Python 2.6) has a start parameter for 1-based or whatever-based taste :)
sunqiang
@Gabe: A tangental but important nitpick: CPU use is still CPU use, whether or not it's the bottleneck for that particular operation. I don't want a program tying up a whole CPU core if it only needs 25% of one. Regardless of whether what *it's* doing will happen any faster as a result, it's still wasting resources that could be used elsewhere.
Glenn Maynard
@Alex Martelli, re "unless you have more than 100GB of RAM available": Actually, it's almost certainly faster to read in smaller chunks--say, 512k at a time--regardless of how much memory you have, because it'll keep the working set in L2 cache. Just as importantly, it means you're doing a very common operation instead of a very unusual one, which means you're going to be hitting much better tested and optimized code paths.
Glenn Maynard
@Glenn, have you noticed that, while I _measure_, you _blabber_? Get some real-world numbers in lieu of all of this blather, and the discussion will be _bilaterally_ meaningful (read Stroustrup's book on the design history of C++ about how his trusting badly "well-argued blabber" w/o supporting **numbers**, **data**, caused design errors that were revealed too late to fix them -- those who don't know history are doomed to repeat it...). Go data-driven, don't repeat the errors of history.
Alex Martelli
@sunqiang, you're right, thanks, will edit and fix, +1.
Alex Martelli
@Gabe, are you claiming that a 2-plus years old Mac laptop can deliver "staggering" disk performance?! I guess Steve Jobs would be tickled pink to read this - even his most unabashed fanboyz don't usually make _quite_ as extravagant claims wrt the performance of Macs!-) So, **measure** (ideally on a serious server, of course), for goodness sake...! Why am I usually the only one who can be bothered to verify claims with numbers?!-)
Alex Martelli
Alex: *Clearly* your disk is not capable of exceeding the bandwidth of its interface. The problem is that I'm claiming that scanning a 100GB file is an I/O-bound operation such that micro-optimizing the scanner won't matter, and you come back with a benchmark where you cache the complete file so you're not doing any I/O at all! I'm making an argument relating to disk speed, but your benchmark never hits the disk, so it's irrelevant.
Gabe
I've tested both methods and it would seem that Gabe's faster. Care to comment?
Gleno
Alex Martelli
@Alex Martelli: Thanks for reminding me (and everyone else present) why I don't respond to you; you're a useless, chronic flamer. EOF.
Glenn Maynard
Alex: If approach B is easier to write, easier to understand, easier to test, and easier to maintain, yet is only a few ms slower, why bother with approach A that might have some difficult corner case (like the line you want straddles multiple buffers). Will those few ms you save on every run ever add up to the extra time you spent coding A? Also, who's to say that the B program will even take longer to finish than the A program? I'd have to see a test on a sufficiently large file (i.e. none of it cached) before I believed that one was faster.
Gabe
+2  A: 

Here's my elegant version in C#:

Console.Write(File.ReadLines(@"s:\source\transactions.dat").ElementAt(219506323));

or more general:

Console.Write(File.ReadLines(filename).ElementAt(linenumber - 1));

Of course, you may want to show some context before and after the given line:

Console.Write(string.Join("\n",
              File.ReadLines(filename).Skip(linenumber - 5).Take(10)));

or more fluently:

File
.ReadLines(filename)
.Skip(linenumber - 5)
.Take(10)
.AsObservable()
.Do(Console.WriteLine);

BTW, the linecache module does not do anything clever with large files. It just reads the whole thing in, keeping it all in memory. The only exceptions it catches are I/O-related (can't access file, file not found, etc.). Here's the important part of the code:

    fp = open(fullname, 'rU')
    lines = fp.readlines()
    fp.close()

In other words, it's trying to fit the whole 100GB file into 6GB of RAM! What the manual should say is maybe "This function will never throw an exception if it can't access the file."

Gabe
+1, Console.Write(File.ReadLines(filename).ElementAt(linenumber - 1)); is certainly elegant. Considering all .NET file i/o is buffered, it would be efficient from disk i/o perspective as well as.
VinayC
This seems to be the fastest method, thanks!
Gleno
A: 

Not a elegant but a faster solution would be to use multiple threads (or tasks in .NET 4.0) to read & process multiple chunks of a file at the same time.

VinayC
It's unlikely that you can actually read in the data faster than you can process it, so trying to do it in parallel is pointless.
Gabe
@Gabe, I am actually suggesting to even read the data parallel (possible by seeking to the file position) - depending upon disk system, it may improve performance.
VinayC
I don't understand what you're suggesting. Since the only way to know what line number a specific byte on is to have looked at every byte prior to that specific one, how is seeking going to help you?
Gabe
A simple solution outline would be - say 4 threads can read 4 blocks of say 16K size at the same time and start counting line breaks within it. Each thread will update master count at the end (needs thread sync). You need to repeat it till you reach near to desired line number. Then you can run through file sequentially.
VinayC
Multiple threads to handle linearly streaming a file? That's not only inelegant, it'll be slower, insanely more complicated and bug-prone. If there's any chance of parallel reads making things faster (eg. RAID), your arbitrarily-aligned interleaving of reads isn't going to line up with that. This is trivial streaming, the most basic and most heavily-optimized I/O case; don't turn it into something ridiculously complicated.
Glenn Maynard
A: 

If you expect to have this operation needed often on the same file it would make sense to make an index.

You make an index by going through the whole file once and recording the positions of line beginnings, for example in a sqlite database. Then when you need to go to a specific line you query the index for it, seek to that position and read the line.

Toni Ruža