ansaurus

Question

sed optimization question (large file modification based on smaller dataset)

Answer 1

+3 A:

I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).

You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.

Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.

I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)

yves Baumes 2009-05-11 17:05:08

Sounds like a good solution, provided the ID of a line can be extracted with reasonable effort - which is not clear from the question but a good assumption, imho.

2009-05-12 13:06:10

Answer 2

+4 A:

My suggested approaches (in order of desirably) would be to process this data as:

A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
A flat file containing fixed record lengths
A flat file containing variable record lengths

Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.

If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?

If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.

MikeyB 2009-05-11 17:17:41

I'm going with the DB method. Oracle is available. Currently sqlldr-ing...

Zsolt Botykai 2009-05-11 17:23:48

The DB solution (with sqlldr, sqlplus) just finished, while the sed still running at 7%...

Zsolt Botykai 2009-05-11 17:43:28

Answer 3

+2 A:

With perl you should use substr to get id_number, especially if id_number has constant width.

my $id_number=substr($str, 500, id_number_length);

After that if $id_number is in range, you should use substr to replace remaining text.

substr($str, -300,300, $new_text);

Perl's regular expressions are very fast, but not in this case.

Alexandr Ciornii 2009-05-11 17:18:58

Answer 4

A:

You should almost certainly use a database, as MikeyB suggested.

If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.

If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:

Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
Adjust the record accordingly if they match
Write it out
Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
Goto 1.

Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.

j_random_hacker 2009-05-12 12:52:41

Answer 5

A:

Good deal on the sqlloader or datadump decision. That's the way to go.

hpavc 2009-05-12 19:04:46

Answer 6

+1 A:

My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.

When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.

You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:

<id><tab><position_after_id><tab><amount><tab><data>

We will read data from stdin and write to stdout and script can be something like this:

my $modifier_filename = 'modifier_file.txt';

open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
   chomp;
   my ($id, $position, $amount, $data) = split /\t/;
   $modifications{$id} = [$position, $amount, $data];
}
close $mf;

# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/;     # compile regexp

while (<>) {
  next unless m/$id_regexp/;
  next unless $modifications{$1};
  my ($position, $amount, $data) = @{$modifications{$1}};
  substr $_, $+[1] + $position, $amount, $data;
}
continue { print }

On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?

If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:

my $mod;
while (<>) {
  next unless m/$id_regexp/;
  $mod = $modifications{$1};
  next unless $mod;
  substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }

Hynek -Pichi- Vychodil 2009-05-16 22:29:35

Although my task finished, I will retry your solution as well, as Oracle is not always available. Anyway, thanks for your help.

Zsolt Botykai 2009-05-17 05:51:16

ansaurus

tags:

views:

answers:

sed optimization question (large file modification based on smaller dataset)

related questions