views:

484

answers:

6

Hi all,

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.

My most recent task involves some line editing based on data from another file.

The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)

The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.

I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.

Let's say I have a line like this in the data file:

(some 500 character)id_number(some 300 character)

And I need to modify data in the 300 char part.

Based on the modifier file, I come up with sed lines like this:

/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/

So I have 1800 lines like this.

But I know, that even on a very fast server, if I do a

sed -i.bak -f modifier.sed data.file

It's very slow, because it has to read every pattern x every line.

Isn't there a better way?

Note: I'm not a programmer, had never learnt (in school) about algorithms. I can use awk, sed, an outdated version of perl on the server.

Thanks in advance!

+3  A: 

I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).

You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.

Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.

I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)

yves Baumes
Sounds like a good solution, provided the ID of a line can be extracted with reasonable effort - which is not clear from the question but a good assumption, imho.
+4  A: 

My suggested approaches (in order of desirably) would be to process this data as:

  1. A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
  2. A flat file containing fixed record lengths
  3. A flat file containing variable record lengths

Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.

If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?

If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.

MikeyB
I'm going with the DB method. Oracle is available. Currently sqlldr-ing...
Zsolt Botykai
The DB solution (with sqlldr, sqlplus) just finished, while the sed still running at 7%...
Zsolt Botykai
+2  A: 

With perl you should use substr to get id_number, especially if id_number has constant width.

my $id_number=substr($str, 500, id_number_length);

After that if $id_number is in range, you should use substr to replace remaining text.

substr($str, -300,300, $new_text);

Perl's regular expressions are very fast, but not in this case.

Alexandr Ciornii
A: 

You should almost certainly use a database, as MikeyB suggested.

If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.

If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:

  1. Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
  2. Adjust the record accordingly if they match
  3. Write it out
  4. Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
  5. Goto 1.

Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.

j_random_hacker
A: 

Good deal on the sqlloader or datadump decision. That's the way to go.

hpavc
+1  A: 

My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.

When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.

You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:

<id><tab><position_after_id><tab><amount><tab><data>

We will read data from stdin and write to stdout and script can be something like this:

my $modifier_filename = 'modifier_file.txt';

open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
   chomp;
   my ($id, $position, $amount, $data) = split /\t/;
   $modifications{$id} = [$position, $amount, $data];
}
close $mf;

# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/;     # compile regexp

while (<>) {
  next unless m/$id_regexp/;
  next unless $modifications{$1};
  my ($position, $amount, $data) = @{$modifications{$1}};
  substr $_, $+[1] + $position, $amount, $data;
}
continue { print }

On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?

If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:

my $mod;
while (<>) {
  next unless m/$id_regexp/;
  $mod = $modifications{$1};
  next unless $mod;
  substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }
Hynek -Pichi- Vychodil
Although my task finished, I will retry your solution as well, as Oracle is not always available. Anyway, thanks for your help.
Zsolt Botykai