views:

90

answers:

3

What's a good algorithm to shuffle the lines in a file without reading the whole file in advance?

I guess it would look something like this: Start reading the file line by line from the beginning, storing the line at each point and decide if you want to print one of the lines stored so far (and then remove from storage) or do nothing and proceed to the next line.

Can someone verify / prove this and/or maybe post working (perl, python, etc.) code?

Related questions, but not looking at memory-efficient algorithms:

+5  A: 

I cannot think of a way to randomly do the entire file without somehow maintaining a list of what has already been written. I think if I had to do a memory efficient shuffle, I would scan the file, building a list of offsets for the new lines. Once I have this list of new line offsets, I would randomly pick one of them, write it to stdout, and then remove it from the list of offsets.

I am not familiar with perl, or python, but can demonstrate with php.

<?php
$offsets = array();

$f = fopen("file.txt", "r");
$offsets[] = ftell($f);
while (! feof($f))
{
  if (fgetc($f) == "\n") $offsets[] = ftell($f);
}

shuffle($offsets);
foreach ($offsets as $offset)
{
  fseek($f, $offset);
  echo fgets($f);
}
fclose($f);
?>

The only other option I can think of, if scanning the file for new lines is absolutely unacceptable, would be (I am not going to code this one out):

  1. Determine the filesize
  2. Create a list of offsets and lengths already written to stdout
  3. Loop until bytes_written == filesize
  4. Seek to a random offset that is not already in your list of already written values
  5. Back up from that seek to the previous newline or start of file
  6. Display that line, and add it to the list of offsets and lengths written
  7. Go to 3.
Brandon Horsley
That sounds like a good sol'n to me +1
Mark
+1  A: 

The following algorithm is linear in the number of lines in your input file.

Preprocessing:

  1. Find n (the total number of lines) by scanning for newlines (or whatever) but store the character number signifying the beginning and end of each line. So you'd have 2 vectors, say, s and e of size n where characters numbering from s[i] to e[i] in your input file is the i th line. In C++ I'd use vector.

  2. Randomly permute a vector of integers from 1 to n (in C++ it would be random_shuffle) and store it in a vector, say, p (e.g. 1 2 3 4 becomes p = [3 1 4 2]). This implies that line i of the new file is now line p[i] in the original file (i.e. in the above example, the 1st line of the new file is the 3rd line of the original file).

Main

  1. Create a new file

  2. Write the first line in the new file by reading the text in the original file between s[p[0]] and e[p[0]] and appending it to the new file.

  3. Continue as in step 2 for all the other lines.

So the overall complexity is linear in the number of lines (since random_shuffle is linear) if you assume read/write & seeking in a file (incrementing the file pointer) are all constant time operations.

Jacob
A: 

You can create an array for N strings and read the first N lines of the file into this array. For the rest you read one line, select by random one of the lines from the array, and replace the this string by the newly read string. Also you write out the string from the array to the output file. This has the advantage that you don't need to iterate over the file twice. The disadvantage is that it will not create a very random output file, especially when N is low (for example this algorithm can't move the last line more than N lines up in the output.)

Edit

Just an example in python:

import sys
import random

CACHE_SIZE = 23

lines = {}

for l in sys.stdin: # you can replace sys.stdin with xrange(200) to get a test output
    i = random.randint(0, CACHE_SIZE-1)
    old = lines.get(i)
    if old:
        print old,
    lines[i] = l

for ignored, p in lines.iteritems():
    print p,
Rudi