views:

91

answers:

5

Is it possible to search multiple pdf files using the 'grep' command. It doesn't seem to work, how do people search content on multiple pdf files?

+2  A: 

use something like Solr or clucene I think they can do what you want.

xenoterracide
I was just looking for a simple search capability here
Samuel
to note: in the future simple search is more of a superuser question, imo. I answered these because these are solutions you might use to do it programmatically.
xenoterracide
+2  A: 

Pdf is a binary format, that's why searching it with grep is not that helpful. You can search the strings is a pdf with grep like this:

ls dir_with_pdfs/*.pdf|xargs strings|grep "keyword"

Or you can use the pdf2text command on pdf's and then search result with grep.

Bozhidar Batsov
Sorry, that is just nonsense! PDF normally uses compressed objects and even if the objects were uncompressed, the text is only partly written in cleartext inside the pdf.
Patrick
+1  A: 

Well, PDF is a binary format, and grep can search binary files as if they were text

grep -a

or you can just use pdftotext (which comes with xpdf) like this:

pdftotext whee.pdf | grep pattern
wsh
grep -a == doesn't seem to work
Samuel
I am able to get this command working only if a pass a "-" after the file name to be searched. i.e.pdftotext whee.pdf - | grep pattern
Samuel
Oh, weird...the - means stdout (which is what you need the text to be passed to for the pipe to work properly), which in my shell you don't need to specify afaik.
wsh
+1  A: 

You don't mention which OS you're using, but under Mac OS X you can use mdfind from the command line:

mdfind -onlyin search/directory/path "kind:pdf search text"
Coxy
+1  A: 

PDF is a binary dump of objects used to display the pages. There may be some meta data you can grep but the actual page text is in a Postscript stream and may be encoded in a variety of ways. Its also not guaranteed to be in any order. You need to think of PDF as more like a Vector image file than a text file.

There is a short article explaining text in PDFs in more detail at http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams

mark stephens