ansaurus

Question

How can I return a random line from a file? Interview Question

Answer 1

+1 A:

Re 1: Use solution to 2

Re 2: You would want to scan the whole file using a RandomAccessFile access to count the number of lines and (possibly) cache the file pointers for each start of line. Then you could choose one at random (I'm assuming this question is not about how to generate random numbers) and move back to that start point, read the line and return it. If you want it fast then make sure you are buffering the reads (raf is v slow otherwise).

Re 3: If the stream doesn't fit in memory (i.e. you cannot cache the whole thing) and you don't know how many lines are in the stream without reading the whole stream (assuming you only get to read it once) then I cannot see a solution. I too wait for answers...

DaveC 2010-01-06 21:14:56

you can do it without knowing the number of lines and without reading all the lines in memory. See my answer for details.

Alok 2010-01-06 23:08:27

Answer 2

+20 A:

Read all lines into an array, return a random line in the range of 1 and the amount of lines.
Most simple: Count the lines, choose a line number at random, go through the file a second time and return it.

You just have to remember one line. Each new line has a probability of 1/N (N being lines read).

Pseudocode:

i = 1
chosen_line = ""
for line in lines:
    if random() < 1/i:    # random returns a uniform random number in [0,1)
        chosen_line = line
    i += 1
return chosen_line

Algorithm number 3 could be used for 1 and 2 too.

Georg 2010-01-06 21:15:12

Your solution #3 is correct but perhaps a bit confusing... to clarify, at each line you read, the chance that you should choose the new line will be 1/N where N is the number of lines you've read. Saying to "choose" (1,2,3) for example is unnecessary and (IMO) confusing. Just keep track of which line you chose last, and update percentages as you go. +1.

Michael Bray 2010-01-06 21:22:15

@dreamlax: that's the point of my last comment... you ONLY have to keep track of the one line that you've chosen, and each new line that you read will have a 1/N chance of replacing that line. N is the number of lines read "so far", not the total number of lines in the file.

Michael Bray 2010-01-06 21:23:47

1: No real sense reading in the whole file into an array if you only want one line.2: Or better yet (though less random, probably): fstat() to get the file size, pick a random point, and read forward/backward from that point until you have a full line of text.

KingRadical 2010-01-06 21:23:57

Solution 3 works and is the same that I came up with. It's apparently a well known algorithm too. It does *not* require the entire file in memory, just the most recently selected line. See http://stackoverflow.com/questions/232237/whats-the-best-way-to-return-a-random-line-in-a-text-file-using-c

Mark Ransom 2010-01-06 21:24:27

@KingRadical: that depends... on how big the file is, how fast your system is, etc etc... reading the entire file at once probably maximizes the efficiency of disk thruput, if that turns out to be your bottleneck.

Michael Bray 2010-01-06 21:25:28

@Michael Bray: Sorry, I deleted my comment, I just realised after I re-read it.

dreamlax 2010-01-06 21:34:55

I don't think you can argue much about question 1 and 3, they are quite clear. The real question is if there's a better approach to 2.

Georg 2010-01-06 21:36:39

gs: small comment error in your code... random() should NOT return 1 - it must be 0 (inclusive) to 1 (non-inclusive) otherwise there is a chance (however small) that a one-line file wouldn't select any line. either that or change < to <=

Michael Bray 2010-01-06 21:43:06

@gs: on whether there is a better approach to #2... I think within the context of the question, the fundamental difference between #2 and #3 is that #3 is implying a forward-only approach, meaning you can't re-read the data, which is why the algorithm provided is necessary. To me, this implies that you should utilize that extra capability in #2. Thus, I think your answer for #2 is about as good as there is.

Michael Bray 2010-01-06 21:45:40

The real problem I see with #2, which is a fairly common optimization problem, is that then you are iterating over a data set twice when you really do not need to. The actual time difference between that and an algorithm that only iterates once, or does not iterate over the data set at all may be negligible for small files, but it is inherently less efficient.

KingRadical 2010-01-06 21:49:25

To me it looks like those are iterative questions. The first one is super-simple, then the interviewee has to think about the second one, the first thing he'll think of is looping twice over it. The third is deliberately not solvable with the approach used in the second. My point is, the second one might be only there to lead to interviewee into the wrong corner.

Georg 2010-01-06 21:55:21

@KingRadical: picking a random point and getting that line won't give every line equal probability to be picked up. Say, file with two lines, the first line is 10 characters, the second one is 30 characters. The second line has 75% chance of being picked, instead of 50%.

sbk 2010-01-06 21:55:24

Use your heads people - if you have a solution to problem number 3 (read from a stream where you don't know the length ahead of time), then you ALREADY have an excellent solution to the first two cases!

Stephen C. Steel 2010-01-06 21:57:53

In solution 3, how can you choose a random line when you don't know the total number of lines? Otherwise, two passes must be made (at least).

Thomas Matthews 2010-01-06 21:58:30

@Stephen: in line with @gs comments, of course #3 will work.. that's not the point of the three questions... as he said, they are iterative in nature.

Michael Bray 2010-01-06 22:01:04

@Thomas: no you don't need two passes. Basically, at each line you read, you determine the chance that the current line WOULD HAVE been picked if it was the last line in the file. The chances of choosing this line decrease with each new line that you read.

Michael Bray 2010-01-06 22:02:23

@gs: I hope you don't mind, but I changed your comment about "random". It should return a uniform random integer in [0,1), not [0,1]. Otherwise, the first line has a slightly smaller probability of being selected. As Michael said, for 1-line file, there is an infinitesimally small probability of not selecting any line.

Alok 2010-01-06 23:34:17

@Thomas: if the stream only has one line, then that line has a 1 in 1 chance of being picked. If the stream has two lines, then each line has a 1 in 2 chance of being picked. With each new line read, the probability that it will be picked is 1/N, where N is the number of lines read so far.

John Bode 2010-01-06 23:46:22

@Alok: Thanks, I'm always glad when someone fixes something in my answers. I'm still trying to figure out if [0,1) or [0,1] with <= is correct. Does it make difference?

Georg 2010-01-07 08:42:22

@jslap: in practice, it works quite well (code up a prototype to play with and see). After reading N lines, the chance that the selected line is still the first line read is 1 in N-1.

John Bode 2010-01-07 18:03:15

The quality starts to degrade after some time because of inaccuracies of doubles. (Unless you create your own random generator that works with fractions of course.)

Georg 2010-01-07 21:47:11

@jslap: it is not so. I have included a proof in my answer.

Alok 2010-01-12 07:46:23

@gs: Mathematically, I think [0,1) with < or [0,1] with <= are equivalent because of infinite real numbers in [0,1]. In floating-point terms, it *might* make a difference, but I don't think so.

Alok 2010-01-12 07:47:54

Sorry. I agree it is a good solution.

jslap 2010-01-13 14:31:52

Answer 3

A:

#3: write the stream to a file on disk and use solution 2. Not the most efficient solution, of course, but very simple.

Doc Brown 2010-01-06 21:27:29

#4: the stream doesn't fit on disk, (if you prefer: the device doesn't have a writeable filesystem). At least, that's the next thing I'd say in an interview, assuming I was setting this problem in the first place.

Steve Jessop 2010-01-06 23:02:29

Yep, but that was not the OPs question ;-)

Doc Brown 2010-01-07 06:44:51

Answer 4

+6 A:

Alok 2010-01-06 23:07:37

Good. We can also prove it without the induction...+1

jslap 2010-01-13 14:36:59

@jslap: yes. I did it by induction because it was an interesting exercise for me. :-)

Alok 2010-01-13 22:54:02

ansaurus

tags:

views:

answers:

How can I return a random line from a file? Interview Question

related questions