ansaurus

Question

Estimating the word count of a file without reading the full file

Answer 1

A:

Can't you compensate for the average number of bytes/char with the ratio of chars-read/bytes-read?

Peter Tillemans 2010-08-18 21:27:02

Answer 2

+11 A:

Why not just make the progress bar based on the bytes processed instead of a word count. You know the size upfront, and then the major difficulty is just getting the bytes per word or bytes per line as you process them.

The easiest way to do this is for each line you read in, use getBytes, providing the character encoding that the file was written in, and then get the length of that. This may not be the most efficient way of doing it, but it will be very accurate and simple to do.

Alternatively, you could read in a fixed number of bytes at a time, and then maintain a buffer yourself to handle partial words and line breaks.

Russell Leggett 2010-08-18 21:36:02

Answer 3

+2 A:

In UTF-8, Hindi text averages to about two bytes per char. You seem to read 1000 chars, and apply the calculation to the file length in bytes. So, if you happen to know the language beforehand, you could compensate for the char to byte ratio.

Otherwise, you could determine the byte count of the first 100 chars to estimate the ratio. I do not know Clojure very well, but maybe you can determine the current position in the file as a byte count with some variant of a seek function after having read 1000 chars?

Frank 2010-08-18 21:36:21

Answer 4

A:

How accurate does your progress bar need to be? I'm guessing the answer isn't "mission critical to the 0.1% accurate". In that case, just check the size of the file and it's encoding and have hard-coded AVG_BYTES_PER_WORD to use with your progress bar.

bluedevil2k 2010-08-18 21:58:13

ansaurus

tags:

views:

answers:

Estimating the word count of a file without reading the full file

related questions