tags:

views:

102

answers:

2

I want to parse through a 8 GB file to find some information. This is taking me more than 4 hours to finish. I gone through perl Parallel::ForkManager module for this. But it doesn't make much difference. What is the better way to implement this?

The following is the part of the code used to parse this Jumbo file. I actually have list of domains which I have to look in a 8 GB sized zone file and find out what company it is hosted with.

    unless(open(FH, $file)) {
        print $LOG "Can't open '$file'  $!";
        die "Can't open '$file'  $!";
    }

    ### Reading Zone file : $file
    DOMAIN: while(my $line = <FH> ){

        #domain and the dns with whom he currently hosted
        my($domain, undef, $new_host) = split(/\s|\t/, $line);
        next if $seen{$domain};
        $seen{$domain} =1;

        $domain.=".$domain_type";
        $domain = lc ($domain);


        #already in?
        if($moved_domains->{$domain}){

            #Get the next domain if this on the same host, there is nothing to record 
            if($new_host eq $moved_domains->{$domain}->{PointingHost}){
                next DOMAIN;
            }
            #movedout
            else{
                @INSERTS = ($domain, $data_date, $new_host, $moved_domains->{$domain}->{Host});
                log_this($data_date, $populate, @INSERTS);
            }
            delete $moved_domains->{$domain};
        }
        #new to MovedDomain
        else{
            #is this any of our interested HOSTS
            my ($interested) = grep{$new_host =~/\b$_\b/i} keys %HOST;

            #if not any of our interested DNS, NEXT!
            next DOMAIN if not $interested;
            @INSERTS = ($domain, $data_date, $new_host, $HOST{$interested});
            log_this($data_date, $populate, @INSERTS);

        }
        next DOMAIN;

    }
+2  A: 

With the little information you've given: Parallel::ForkManager sounds like an appropriate tool. But you're likely to get better help if you give more detail about your problem.

Parallelizing is always a difficult problem. How much you can hope to gain depends a lot on the nature of the task. For example, are you looking for a specific line in the file? Or a specific fixed-size record? Or all the chunks that match a particular bit pattern? Do you process the file from beginning to end, or can you skip some parts, or do you do a lot of shuffling back and forth? etc.

Also is the 8GB file an absolute constraint, or might you be able to reorganize the data to make the information easier to find?

With the speeds you're giving, if you're just going through the file once, I/O is not the bottleneck, but it's close. It could be the bottleneck if other processes are accessing the disk at the same time. It may be worth fine-tuning your disk access patterns; this would be somewhat OS- and filesystem-dependent.

Gilles
+5  A: 

A basic line-by-line parsing pass through a 1GB file -- for example, running a regex or something -- takes just a couple of minutes on my 5-year-old Windows box. Even if the parsing work is more extensive, 4 hours sounds like an awfully long time for 8GB of data.

Are you sure that your code does not have a glaring inefficiency? Are you storing a lot of information during the parsing and bumping up against your RAM limits? CPAN has tools that will allow you to profile your code, notably Devel::NYTProf.

Before going through the hassle of parallelizing your code, make sure that you understand where the bottleneck is. If you explain what you are doing or, even better, provide code that illustrates the problem in a compact way, you might get better answers.

FM