views:

67

answers:

2

I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.

But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:

Can't open my script as the script is over the arbitrary file size limit which is currently 500000.


My questions:

Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?

If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?

How do people normally format the text data needed for a dictionary application?

Any comments, ideas or suggestions? Thanks like always :)

+2  A: 

Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?

The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):

words.pl

#!/usr/bin/perl

use strict;
use warnings;

use Words;

for my $i (1 .. 2) {
    print "Run $i\n";
    while (defined(my $word = Words->next_word)) {
        print "\t$word\n";
    }
}

Words.pm

package Words;

use strict;
use warnings;

my $start = tell DATA
    or die "could not find current position: $!";

sub next_word {
    if (eof DATA) {
        seek DATA, $start, 0
        or die "could not seek: $!";
        return undef;
    }
    chomp(my $word = scalar <DATA>);
    return $word;
}

1;

__DATA__
a
b
c
Chas. Owens
@Chas, thank you for sharing me this great tip :) I've just tested THE easy way you suggested and pp does the thing just right! That's cool!
Mike
@Mike I am still playing with the hard, right way. Basically it comes down to adding `-a words.txt` to the `pp` line. If you want to read the whole file in at once, you can say `my $words = PAR::read_file('words.txt');`. I am still working on a method of reading the lines one by one. I believe it will involve `PAR::par_handle` and [`Archive::Zip::MemberRead`](http://search.cpan.org/dist/Archive-Zip/lib/Archive/Zip/MemberRead.pm).
Chas. Owens
+2  A: 

If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?

Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.

However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.

How do people normally format the text data needed for a dictionary application?

IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.

In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.

Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?

This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.

In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).

It will also decrease the process memory consumption if you go with the approaches I mentioned above.

DVK
These days I'd probably reach for YAML first to store data textually, since its format is human-readable and editable, and the interface is very easy to use and understand (plus anyone running a reasonably recent version of Perl should already have it installed).
Ether
@Ether - does YAML offer scalable well performing random lookups? Or is it just a formatting language ala XML with XSLT-like lookups (at 30MB, an XML+XSLT type approach becomes significantly worse than a proper database as far as performance)
DVK
@DVK: [YAML is only a serialization framework.](http://search.cpan.org/dist/YAML/lib/YAML.pm) If you pack a Perl's hash into, then yes, it would provide the proper random look-up. Because it is a hash.
Dummy00001
@DVK, thank you for this detailed answer :) It makes very good sense to me!
Mike
@Ether and @Dummy00001, thanks for the comments. YAML sounds like an option. I'll take a look at it.
Mike