views:

311

answers:

7

Hello

I've got text file that contains 1 000 002 numbers in following formation:

123 456
1 2 3 4 5 6 .... 999999 100000

Now I need to read that data and allocate it to int variables (the very first two numbers) and all the rest (1 000 000 numbers) to an array int[].

It's not a hard task, but - it's horrible slow.

My first attempt was java.util.Scanner:

 Scanner stdin = new Scanner(new File("./path"));
 int n = stdin.nextInt();
 int t = stdin.nextInt();
 int array[] = new array[n];

 for (int i = 0; i < n; i++) {
     array[i] = stdin.nextInt();
 }

It works as excepted but it takes about 7500 ms to execute. I need to fetch that data in up to several hundred of milliseconds.

Then I tried java.io.BufferedReader:

Using BufferedReader.readLine() and String.split() I got the same results in about 1700 ms, but it's still too many.

How can I read that amount of data in less that 1 second? The final result should be equal to:

int n = 123;
int t = 456;
int array[] = { 1, 2, 3, 4, ..., 999999, 100000 };

According to trashgod answer:

StreamTokenizer solution is fast (takes about 1400 ms) but it's still too slow:

StreamTokenizer st = new StreamTokenizer(new FileReader("./test_grz"));
st.nextToken();
int n = (int) st.nval;

st.nextToken();
int t = (int) st.nval;

int array[] = new int[n];

for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
    array[i] = (int) st.nval;
}

PS. There is no need for validation. I'm 100% sure that data in ./test_grz file is correct.

+1  A: 

How much memory do you have in the computer? You could be running into GC issues.

The best thing to do is to process the data one line at a time if possible. Don't load it into an array. Load what you need, process, write it out, and continue.

This will reduce your memory footprint and still use the same amount of File IO

Pyrolistical
It looks like his second line is one looong line that contains a million numbers..
SB
If my calculations are correct 1 mln of `int` costs me only 7 MB of memory - that's not so much. I just need to load that data from file to memory - I'll need that for some calculations that requires whole data to be loaded.
Crozin
+1  A: 

StreamTokenizer may be faster, as suggested here.

trashgod
In fact StreamTokenizer seems to be the fastest solution so far (please check my question update). But it still needs about 1400 ms to read necessary data.
Crozin
thanks TG, StreamTokenizer is very nice.
KevinDTimm
Excellent. See also @Kevin Brock's informative answer: http://stackoverflow.com/questions/2693223/read-large-amount-of-data-from-file-in-java/2694507#2694507
trashgod
+1  A: 

It it's possible to reformat the input so that each integer is on a separate line (instead of one long line with one million integers), you should be seeing much improved performance using Integer.parseInt(BufferedReader.readLine()) due to smarter buffering by line and not having to split the long string into a separate array of Strings.

Edit: I tested this and managed to read the output produced by seq 1 1000000 into an array of int well under half a second, but of course this depends on the machine.

Arkku
Unfortunately I cannot change file format. It has to be two integers separated by a single space in the first line and 1 mln of integers in the second line (also separated by a single space).
Crozin
A: 

Defrag your drive. Close all other applcations. Use hdparm to optimze the drive. Try Java's NIO package.

Dave Jarvis
A: 

I would extend FilterReader and parse the string as it is read in the read() method. Have a getNextNumber method return the numbers. Code left as an exercise for the reader.

Skip Head
+1  A: 

You can reduce the time for the StreamTokenizer result by using a BufferedReader:

Reader r = null;
try {
    r = new BufferedReader(new FileReader(file));
    final StreamTokenizer st = new StreamTokenizer(r);
    ...
} finally {
    if (r != null)
        r.close();
}

Also, don't forget to close your files, as I've shown here.

You can also shave some more time off by using a custom tokenizer just for your purposes:

public class CustomTokenizer {

    private final Reader r;

    public CustomTokenizer(final Reader r) {
        this.r = r;
    }

    public int nextInt() throws IOException {
        int i = r.read();
        if (i == -1)
            throw new EOFException();

        char c = (char) i;

        // Skip any whitespace
        while (c == ' ' || c == '\n' || c == '\r') {
            i = r.read();
            if (i == -1)
                throw new EOFException();
            c = (char) i;
        }

        int result = (c - '0');
        while ((i = r.read()) >= 0) {
            c = (char) i;
            if (c == ' ' || c == '\n' || c == '\r')
                break;
            result = result * 10 + (c - '0');
        }

        return result;
    }

}

Remember to use a BufferedReader for this. This custom tokenizer assumes the input data is always completely valid and contains only spaces, new lines, and digits.

If you read these results a lot and those results do not change much, you should probably save the array and keep track of the last file modified time. Then, if the file has not changed just use the cached copy of the array and this will speed up the results significantly. For example:

public class ArrayRetriever {

    private File inputFile;
    private long lastModified;
    private int[] lastResult;

    public ArrayRetriever(File file) {
        this.inputFile = file;
    }

    public int[] getResult() {
        if (lastResult != null && inputFile.lastModified() == lastModified)
            return lastResult;

        lastModified = inputFile.lastModified();

        // do logic to actually read the file here

        lastResult = array; // the array variable from your examples
        return lastResult;
    }

}
Kevin Brock
Thanks for the answer - I'll check it tomorrow - I hope that this is what I am looking for.
Crozin
+1 It might be worth specifying the buffer size when constructing the `BufferedReader`, too.
trashgod
+1  A: 

Thanks for every answer but I've already found a method that meets my criteria:

BufferedInputStream bis = new BufferedInputStream(new FileInputStream("./path"));
int n = readInt(bis);
int t = readInt(bis);
int array[] = new int[n];
for (int i = 0; i < n; i++) {
    array[i] = readInt(bis);
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

It requires only about 300 ms to read 1 mln of integers!

Crozin