ansaurus

Question

How to shuffle the lines in a file without reading the whole file in advance?

Answer 1

+5 A:

I cannot think of a way to randomly do the entire file without somehow maintaining a list of what has already been written. I think if I had to do a memory efficient shuffle, I would scan the file, building a list of offsets for the new lines. Once I have this list of new line offsets, I would randomly pick one of them, write it to stdout, and then remove it from the list of offsets.

I am not familiar with perl, or python, but can demonstrate with php.

<?php
$offsets = array();

$f = fopen("file.txt", "r");
$offsets[] = ftell($f);
while (! feof($f))
{
  if (fgetc($f) == "\n") $offsets[] = ftell($f);
}

shuffle($offsets);
foreach ($offsets as $offset)
{
  fseek($f, $offset);
  echo fgets($f);
}
fclose($f);
?>

The only other option I can think of, if scanning the file for new lines is absolutely unacceptable, would be (I am not going to code this one out):

Determine the filesize
Create a list of offsets and lengths already written to stdout
Loop until bytes_written == filesize
Seek to a random offset that is not already in your list of already written values
Back up from that seek to the previous newline or start of file
Display that line, and add it to the list of offsets and lengths written
Go to 3.

Brandon Horsley 2010-07-30 00:02:52

That sounds like a good sol'n to me +1

Mark 2010-07-30 00:42:52

Answer 2

+1 A:

The following algorithm is linear in the number of lines in your input file.

Preprocessing:

Find n (the total number of lines) by scanning for newlines (or whatever) but store the character number signifying the beginning and end of each line. So you'd have 2 vectors, say, s and e of size n where characters numbering from s[i] to e[i] in your input file is the i th line. In C++ I'd use vector.
Randomly permute a vector of integers from 1 to n (in C++ it would be random_shuffle) and store it in a vector, say, p (e.g. 1 2 3 4 becomes p = [3 1 4 2]). This implies that line i of the new file is now line p[i] in the original file (i.e. in the above example, the 1st line of the new file is the 3rd line of the original file).

Main

Create a new file
Write the first line in the new file by reading the text in the original file between s[p[0]] and e[p[0]] and appending it to the new file.
Continue as in step 2 for all the other lines.

So the overall complexity is linear in the number of lines (since random_shuffle is linear) if you assume read/write & seeking in a file (incrementing the file pointer) are all constant time operations.

Jacob 2010-07-30 00:18:16

Answer 3

A:

You can create an array for N strings and read the first N lines of the file into this array. For the rest you read one line, select by random one of the lines from the array, and replace the this string by the newly read string. Also you write out the string from the array to the output file. This has the advantage that you don't need to iterate over the file twice. The disadvantage is that it will not create a very random output file, especially when N is low (for example this algorithm can't move the last line more than N lines up in the output.)

Edit

Just an example in python:

import sys
import random

CACHE_SIZE = 23

lines = {}

for l in sys.stdin: # you can replace sys.stdin with xrange(200) to get a test output
    i = random.randint(0, CACHE_SIZE-1)
    old = lines.get(i)
    if old:
        print old,
    lines[i] = l

for ignored, p in lines.iteritems():
    print p,

Rudi 2010-07-30 12:23:45

ansaurus

tags:

views:

answers:

How to shuffle the lines in a file without reading the whole file in advance?

Preprocessing:

Main

related questions