views:

663

answers:

4

I am preparing for a phone interview. I came upon these questions on the internet. Can anyone tell me some good answers for these?

  1. Suppose I give you a text file and ask you a to write a program that will return a random line from the file (all lines must have equal probability to be returned)

  2. Same as part 1, except this time the entire text file cannot fit into main memory

  3. Same as part 2, except now you have a stream instead of a file.

Please help.

Ok...@Everyone, I really had a few ideas in my mintod before asking this...Seeing the relentless attack by my fellow SOers, I am posting my answers. Please feel free to attack them too...

1: Count the number of '\n' in the file. Generate a random number between 1 and the number and return the line after the number-1 '\n'.

2: Bring the file into main memory part by part and follow the above procedure.

3: I dont have much idea about this and would appreciate any inputs.

Its wonderful that you guys really give an inspiration to push forward.....

+1  A: 

Re 1: Use solution to 2

Re 2: You would want to scan the whole file using a RandomAccessFile access to count the number of lines and (possibly) cache the file pointers for each start of line. Then you could choose one at random (I'm assuming this question is not about how to generate random numbers) and move back to that start point, read the line and return it. If you want it fast then make sure you are buffering the reads (raf is v slow otherwise).

Re 3: If the stream doesn't fit in memory (i.e. you cannot cache the whole thing) and you don't know how many lines are in the stream without reading the whole stream (assuming you only get to read it once) then I cannot see a solution. I too wait for answers...

DaveC
you can do it without knowing the number of lines and without reading all the lines in memory. See my answer for details.
Alok
+20  A: 
  1. Read all lines into an array, return a random line in the range of 1 and the amount of lines.

  2. Most simple: Count the lines, choose a line number at random, go through the file a second time and return it.

  3. You just have to remember one line. Each new line has a probability of 1/N (N being lines read).

    Pseudocode:

    i = 1
    chosen_line = ""
    for line in lines:
        if random() < 1/i:    # random returns a uniform random number in [0,1)
            chosen_line = line
        i += 1
    return chosen_line
    

Algorithm number 3 could be used for 1 and 2 too.

Georg
Your solution #3 is correct but perhaps a bit confusing... to clarify, at each line you read, the chance that you should choose the new line will be 1/N where N is the number of lines you've read. Saying to "choose" (1,2,3) for example is unnecessary and (IMO) confusing. Just keep track of which line you chose last, and update percentages as you go. +1.
Michael Bray
@dreamlax: that's the point of my last comment... you ONLY have to keep track of the one line that you've chosen, and each new line that you read will have a 1/N chance of replacing that line. N is the number of lines read "so far", not the total number of lines in the file.
Michael Bray
1: No real sense reading in the whole file into an array if you only want one line.2: Or better yet (though less random, probably): fstat() to get the file size, pick a random point, and read forward/backward from that point until you have a full line of text.
KingRadical
Solution 3 works and is the same that I came up with. It's apparently a well known algorithm too. It does *not* require the entire file in memory, just the most recently selected line. See http://stackoverflow.com/questions/232237/whats-the-best-way-to-return-a-random-line-in-a-text-file-using-c
Mark Ransom
@KingRadical: that depends... on how big the file is, how fast your system is, etc etc... reading the entire file at once probably maximizes the efficiency of disk thruput, if that turns out to be your bottleneck.
Michael Bray
@Michael Bray: Sorry, I deleted my comment, I just realised after I re-read it.
dreamlax
I don't think you can argue much about question 1 and 3, they are quite clear. The real question is if there's a better approach to 2.
Georg
gs: small comment error in your code... random() should NOT return 1 - it must be 0 (inclusive) to 1 (non-inclusive) otherwise there is a chance (however small) that a one-line file wouldn't select any line. either that or change < to <=
Michael Bray
@gs: on whether there is a better approach to #2... I think within the context of the question, the fundamental difference between #2 and #3 is that #3 is implying a forward-only approach, meaning you can't re-read the data, which is why the algorithm provided is necessary. To me, this implies that you should utilize that extra capability in #2. Thus, I think your answer for #2 is about as good as there is.
Michael Bray
The real problem I see with #2, which is a fairly common optimization problem, is that then you are iterating over a data set twice when you really do not need to. The actual time difference between that and an algorithm that only iterates once, or does not iterate over the data set at all may be negligible for small files, but it is inherently less efficient.
KingRadical
To me it looks like those are iterative questions. The first one is super-simple, then the interviewee has to think about the second one, the first thing he'll think of is looping twice over it. The third is deliberately not solvable with the approach used in the second. My point is, the second one might be only there to lead to interviewee into the wrong corner.
Georg
@KingRadical: picking a random point and getting that line won't give every line equal probability to be picked up. Say, file with two lines, the first line is 10 characters, the second one is 30 characters. The second line has 75% chance of being picked, instead of 50%.
sbk
Use your heads people - if you have a solution to problem number 3 (read from a stream where you don't know the length ahead of time), then you ALREADY have an excellent solution to the first two cases!
Stephen C. Steel
In solution 3, how can you choose a random line when you don't know the total number of lines? Otherwise, two passes must be made (at least).
Thomas Matthews
@Stephen: in line with @gs comments, of course #3 will work.. that's not the point of the three questions... as he said, they are iterative in nature.
Michael Bray
@Thomas: no you don't need two passes. Basically, at each line you read, you determine the chance that the current line WOULD HAVE been picked if it was the last line in the file. The chances of choosing this line decrease with each new line that you read.
Michael Bray
@gs: I hope you don't mind, but I changed your comment about "random". It should return a uniform random integer in [0,1), not [0,1]. Otherwise, the first line has a slightly smaller probability of being selected. As Michael said, for 1-line file, there is an infinitesimally small probability of not selecting any line.
Alok
@Thomas: if the stream only has one line, then that line has a 1 in 1 chance of being picked. If the stream has two lines, then each line has a 1 in 2 chance of being picked. With each new line read, the probability that it will be picked is 1/N, where N is the number of lines read so far.
John Bode
@Alok: Thanks, I'm always glad when someone fixes something in my answers. I'm still trying to figure out if [0,1) or [0,1] with <= is correct. Does it make difference?
Georg
@jslap: in practice, it works quite well (code up a prototype to play with and see). After reading N lines, the chance that the selected line is still the first line read is 1 in N-1.
John Bode
The quality starts to degrade after some time because of inaccuracies of doubles. (Unless you create your own random generator that works with fractions of course.)
Georg
@jslap: it is not so. I have included a proof in my answer.
Alok
@gs: Mathematically, I think [0,1) with < or [0,1] with <= are equivalent because of infinite real numbers in [0,1]. In floating-point terms, it *might* make a difference, but I don't think so.
Alok
Sorry. I agree it is a good solution.
jslap
A: 

#3: write the stream to a file on disk and use solution 2. Not the most efficient solution, of course, but very simple.

Doc Brown
#4: the stream doesn't fit on disk, (if you prefer: the device doesn't have a writeable filesystem). At least, that's the next thing I'd say in an interview, assuming I was setting this problem in the first place.
Steve Jessop
Yep, but that was not the OPs question ;-)
Doc Brown
+6  A: 
Alok
Good. We can also prove it without the induction...+1
jslap
@jslap: yes. I did it by induction because it was an interesting exercise for me. :-)
Alok