views:

470

answers:

3

What would be the best approach to reverse a large text file that is uploaded asynchronously to a servlet that reverses this file in a scalable and efficient way?

  • text file can be massive (gigabytes long)
  • can assume mulitple server/clustered environment to do this in a distributed manner.
  • open source libraries are encouraged to consider

I was thinking of using Java NIO to treat file as an array on disk (so that I don't have to treat the file as a string buffer in memory). Also, I am thinking of using MapReduce to break up the file and process it in separate machines.

Any input is appreciated. Thanks.

  • Daniel
+2  A: 

If it is uploaded to you and you can get the length at the beginning, you could just create an empty full-sized file up front and write to it starting from the back and working your way to the front using seek

You'd probably want to define a block size (like 1K?) and reverse that much in memory before writing it out to the file.

Bill K
+1  A: 

That's a pretty tough task. If you can ensure that the HTTP Content-Length and Content-Type headers are present in the upload request (or in the multipart body when it's a multipart/form-data request), then it would be an easy job with help of RandomAccessFile. The content length is mandatory so that the RandomAccessFile knows how long the file will be and write the character at the position you want it to be. The character encoding (which is usually present as an attribute of the content type header) is mandatory to know how many bytes a character will take into account (because RandomAccessFile is byte based and for example UTF-8 encoding is variable-byte-length).

Here's a kickoff example (leaving obvious exception handling aside):

package com.stackoverflow.q2725897;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.RandomAccessFile;
import java.io.Reader;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;

public class Test {

    public static void main(String... args) throws Exception {

        // Stub input. You need to gather it yourself from your sources.
        File file = new File("/file.txt");
        long length = file.length(); // Get it from HTTP request header using file upload API in question (Commons FileUpload?).
        String encoding = "UTF-8"; // Get it from HTTP request header using file upload API in question (Commons FileUpload?).
        InputStream content = new FileInputStream(file); // Get it from HTTP request body using file upload API in question (Commons FileUpload?).

        // Now the real job.
        Reader input = new InputStreamReader(content, encoding);
        RandomAccessFile output = new RandomAccessFile(new File("/filereversed.txt"), "rwd");
        CharsetEncoder encoder = Charset.forName(encoding).newEncoder();

        for (int data; (data = input.read()) != -1;) {
            ByteBuffer bytes = encoder.encode(CharBuffer.wrap(new char[] { (char) data }));
            length -= bytes.limit();
            output.seek(length);
            output.write(bytes.array());
        }

        // Should actually be done in finally.
        input.close();
        output.close();
    }

}

If those headers are not present (especially Content-length is important), then you'll really need to store it on disk first until end of stream and then re-read and reverse it the same way with help of RandomAccessFile.

Update: it would actually be tougher than it look like. Is the character encoding of the input always guaranteed to be the same? If so, what one would it be? Also, what would you like to do with for example surrogate characters and newlines? The above example doesn't take that into account correctly. But it at least gives the base idea.

BalusC
Thanks BalusC for your reply. I am assuming unicode/UTF-8 encoding. I don't think we need to do anything special with newlines.
DanJanson
You can convert a single char to a String using Character.toString((char) data)
Kathy Van Stone
When you get the byte array to calculcate its length, you should keep it around and write that to output. Currently you are writing the code point (effectively calling output.write(int b)), not the encoded bytes.
Kathy Van Stone
Also if Transfer-Encoding is set to "chunked" you will not have a Content-Length.
Kathy Van Stone
@Kathy: 1) That's not more efficient. By another topic I learnt that `CharBuffer#wrap()` is better. 2) No, I've casted it to char. 3) That depends on the client side. I already stated that the content length is mandatory.
BalusC
@BalusC I don't see a write(char c) method in RandomAccessFile. And if you are not specifying the encoding you are writing the code point not the encoded value.
Kathy Van Stone
@BalusC http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.13 says that Content-Length is optional. Where do you see that it is required?
Kathy Van Stone
I suppose it might be mandatory on uploading as opposed to downloading
Kathy Van Stone
@Kathy: Uh, please use the edit link to edit comments :) 1) Yes, I see, thank you for correcting this. 2) Mandatory in the server side, because you otherwise have to store the entire file first. Also see the text I wrote after the code. 3) Yeah... This smells too much nitpicking. I'll leave it as it is :)
BalusC
@BalusC which smells too much like nitpicking?
Kathy Van Stone
A: 

Save it in manageable chunks to disk as they come in, and then read the chunks backward when needed and present the content backwards.

Would 1 Mb be a reasonable size, given the amount available to a normal java application these days?

Thorbjørn Ravn Andersen