Hi all I want to read a very big text file(a log file of a web app)and do some processing.
Is there any Framework to help doing such work ?
The file is 100M+,shall I use mutil-thread ?
best regards
Hi all I want to read a very big text file(a log file of a web app)and do some processing.
Is there any Framework to help doing such work ?
The file is 100M+,shall I use mutil-thread ?
best regards
If the file is very big and you want to process it as a whole (not just grep it, or do line-wise processing) there's a risk you'll run out of RAM memory (or at least, cause your memory to get cluttered).
A more robust solution will be to parse the file line-wise, store it to some on-disk random-access application (a database) and then use this application to do the processing.
It will slow down your processing, since you use the disk, but it will make sure that the performance level remains constant, regardless of the file size.
Depending on your needs, the most efficient solution may be to launch an external program designed to do this kind of work, like perl, grep or awk, and then just tell it what to do, and then postprocess the result.
In your case multi-threading will not help much as the problem is I/O bound rather than CPU bound (well, unless you are trying to do a lot of processing of the text file in memory and then write it back). If the concern is reading the file, generally 100 MB is something a large system can handle. If that is the size of the file and you are running on a Unix machine, see if you can run your code under the 64 bit VM. Of course this is not really a permanent solution.
A scalable solution is for you to read the file line by line and keep only the data that you want and finally work on that data alone (assuming you can do off-line processing). The approach by Little Bobby Tables is a good one since it gives you a constant processing time (actually it will be O(n) where n is the number of lines to process).
Hadoop is great for this: http://hadoop.apache.org/ - it'll handle threading, distributing to different machines, has a lot of functionality around text input, etc. The map-reduce paradigm is a bit different, but definitely consider this.
I wrote recently a log analyzer with 300M+ log files. I use the Apache Commons IO LineIterator class which performed fine (20 seconds)
For less IO you don't need to unzip the file first but use
new InputStreamReader(new GZIPInputStream(new FileInputStream(logFile)), "US-ASCII");
as input reader.