views:

1148

answers:

6

I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string. To date I have been using this:

my @search_results = `grep -i -l \"$string\" *.pdf`;

where $string is the text to look for. However this fails for most pdf's because the file format is obviously not ASCII.

What can I do that's easiest?

Clarification: There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.

Final solution using Adam Bellaire's suggestion below:

@search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;
+7  A: 

The PerlMonks thread here talks about this problem.

It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:

my @search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;
Adam Bellaire
+2  A: 

I second Adam Bellaire solution. I used pdftotext utility to create full-text index of my ebook library. It's somewhat slow but does its job. As for full-text, try PLucene or KinoSearch to store full-text index.

aku
+2  A: 

You may want to look at PDF::Core

dsm
+1  A: 

The easiest fulltext index/seach I've used is mysql. You just insert into the table with the appropriate index on it. You need to spend some time working out the relative weightings for fields (a match in the title might score higher than a match in the body), but this is all possible, albeit with some hairy sql.

Plucene is deprecated (there hasn't been any active work on it in the last two years afaik) in favour of KinoSearch. KinoSearch grew, in part, out of understanding the architectural limitations of Plucene.

If you have ~300 pdfs, then once you've extracted the text from the PDF (assuming the PDF has text and not just images of text ;) and depending on your query volumes you may find grep is sufficient.

However, I'd strongly suggest the mysql/kinosearch route as they have covered a lot of ground (stemming, stopwords, term weighting, token parsing) that you don't benefit from getting bogged down with.

KinoSearch is probably faster than the mysql route, but the mysql route gives you more widely used standard software/tools/developer-experience. And you get the ability to use the power of sql to augement your freetext search queries.

So unless you're talking HUGE data-sets and insane query volumes, my money would be on mysql.

+2  A: 

My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:

my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
   my $text = $doc->getPageText($pagenum);
   print $text;
}
Chris Dolan
A: 

You could try Lucene (the Perl port is called Plucene). The searches are incredibly fast and I know that PDFBox already knows how to index PDF files with Lucene. PDFBox is Java, but chances are there is something very similar somewhere in CPAN. Even if you can't find something that already adds PDF files to a Lucene index it shouldn't be more than a few lines of code to do it yourself. Lucene will give you quite a few more searching options than simply looking for a string in a file.

There's also a very quick and dirty way. Text in a PDF file is actually stored as plain text. If you open a PDF in a text editor or use 'strings' you can see the text in there. The binary junk is usually embedded fonts, images, etc.

jm4