Need advice in Efficiency: Scanning 2 very large files worth of information

views:

answers:

+2 Q:

Need advice in Efficiency: Scanning 2 very large files worth of information

Hi,

I have a relatively strange question.

I have a file that is 6 gigabytes long. What I need to do, is scan the entire file, line by line, and determine all rows that match an id number of any other row in the file. Essentially, its like analyzing a web log file where there are many session ids that are organized by the time of each click rather than by userID.

I tried to do the simple (dumb) thing, which was to create 2 file readers. One that scans the file line by line getting the userID, and the next to 1. verify that the userID has not been processed already and 2. If it hasn't been processed, read every line that begins with the userID that is contained in the file and store (some value X, related to the rows)

Any advice or tips on how I can make this process work more efficiently?

How much data are you storing about each line, compared with the size of the line? Do you have enough memory to maintain the state for each distinct ID (e.g. number of log lines seen, number of exceptions or whatever)? That's what I'd do if possible.

Otherwise, you'll either need to break the log file into separate chunks (e.g. split it based on the first character of the ID) and then parse each file separately, or perhaps have some way of pretending you have enough memory to maintain the state for each distinct ID: have an in-memory cache which dumps values to disk (or reads them back) only when it has to.

Jon Skeet 2010-02-09 14:41:28

+3 A:

Easiest: create a datamodel and import the file in a database and take benefit of JDBC and SQL powers. You can if necessary (when the file format is pretty specific) write a some Java which does import line by line with help of under each BufferedReader#readLine() and PreparedStatement#addBatch().

Hardest: write your Java code so that it doesn't unnecessarily keep large amounts of data in the memory. You're then basically reinventing what the average database already does.

BalusC 2010-02-09 14:41:54

+4 A:

Import file into SQL database
Use SQL
Performance!

Seriously, that's it. Databases are optimized exactly for this kind of thing. Alternatively, if you have a machine with enough RAM, just put all the data into a HashMap for easy lookup.

Michael Borgwardt 2010-02-09 14:42:08

+1 A:

For each row R in the file {

Let N be the number that you need to extract from R.
Check if there is a file called N. If not, create it.
Append R to the file called N

}

Bruno Rothgiesser 2010-02-09 14:45:56

Make sure you don't do that on a FAT filesystem, or you're in for a world of pain. Scratch that, you're in for a world of pain anyway because your HD will be thrashing and thus the whole thing will take a long, long time.

Michael Borgwardt 2010-02-09 14:59:58

That's pretty interesting... That would leave me with thousands upon thousands of files - then I suppose I would have to analyze each file seperately...

rockit 2010-02-09 15:01:11

You don't mention whether or not this is a regular, ongoing thing or an occasional check.

Have you considered pre-processing the data? Not practical for dynamic data, but if you can sort it based on the field you're interested in, it makes solving the problem much easier. Extracting only the fields you care about may reduce the data volume to a more manageable size as well.

chris 2010-02-09 14:52:33

Alot of the other advice here is good but assumes that you'll be able to load what you need into memory without running out of memory. If you can do that that would be better than the 'worst case' solution I'm mentioning.

If you have large files you may end up needing to sort them first. In the past I've dealt with multiple large files where I needed to match them up based on a key (sometimes matches were in all files, sometimes only in a couple, etc). If this is the case the first thing you need to do is sort your files. Hopefully you're on a box where you can easily do this (for example there are many good Unix scripts for this). After you've sorted each file read each file until you get matching IDs then process.

I'd suggest:
1. Open both files and read the first record
2. See if you have matching IDs and processing accordingly
3. Read the file(s) for the key just processed and do step 2 again until EOF.

For example if you had a key of 1,2,5,8 in FILE1 and 2,3,5,9 in FILE2 you'd:
1. Open and read both files (FILE1 has ID 1, FILE2 had ID2).
2. Process 1.
3. Read FILE1 (FILE1 has ID 2)
4. Process 2.
5. Read FILE1 (ID 5) and FILE2 (ID 3)
6. Process 3.
7. Read FILE 2 (ID 5)
8. Process 5.
9. Read FILE1 (ID 8) and FILE2 (ID 9).
10. Process 8.
11. Read FILE1 (EOF....no more FILE1 processing).
12. Process 9.
13. Read FILE2 (EOF....no more FILE2 processing).

Make sense?

SOA Nerd 2010-02-09 15:14:19

ansaurus

tags:

views:

answers:

Need advice in Efficiency: Scanning 2 very large files worth of information

related questions