Well, the same as in any other language/environment: Understand the file format enough to extract strings.
And yes, for many file formats this means that you should write at least half a parser for the format. PDF is especially icky, as there are no spaces per se; that's just a convention of how far apart the glyphs are; furthermore PDF can contain compressed streams so simply searching for printable strings in the file doesn't yield anything of value.
Naturally, you can look for a library or another tool which already does this. I've seen a document repository which simply passed PDF files through pdf2ascii and fed the resulting text to Lucene.