There are basically two techniques used for full-text search over big corpuses: posting lists and suffix arrays.
A posting list is a list of (term, document_id) pairs, optionally with a position in the document. If you sort it or hash it by term, you have an efficiently-searchable full-text index.
There are various techniques for making posting lists smaller, faster to access, faster to update, and more flexible, some at the cost of accuracy. Lucene is probably the best off-the-shelf posting-list-based text indexer available today, and (contrary to your earlier comment) it can index text found in PDF, Microsoft Word, etc., files. The Lucene.net project linked by Thomas Maierhofer looks like a pretty reasonable port, although of course you'll always be a bit behind the leading edge of what's going on in the Java version.
For a corpus that's much bigger than memory, you pretty much have to store the posting list on disk. That militates against using a simple binary search tree to access it: if you have a hundred thousand documents of ten thousand words each, you have a billion postings, which means your binary search tree has a minimum depth of 30. The trouble with that is that the 30 nodes on the path from the root of the tree to the leaf will, in general, be located in different parts of your disk — so the disk has to seek 30 times to find the postings for one term! That's about 2½ seconds, which is prohibitively slow.
However, there is a modified version of the binary tree data structure called a “B-tree” which can work. Lucene uses a simple data structure that's a lot like a B-tree, but supports massive updates much more easily. I wrote a very simple version of this same data structure in my own dumbfts project, which implements a full-text search engine for my email in a few pages of Python. I use it every day, it's free software, and it works pretty well for what I use it for, but it's not exactly a world-class search system like Lucene.
As an example of making posting lists smaller at the cost of accuracy, the Managing Gigabytes book (and the mg4j project) has a data structure called a "signed minimal perfect hash table", which doesn't actually store the terms that were indexed — just hashes of them. So there's a small probability of a false positive — you have to retrieve the documents that supposedly contain the term in order to confirm that they really do.
Suffix arrays, which are a much more compact and slightly slower version of radix trees (aka tries), are implemented by GLIMPSE and a few other programs, but they've basically fallen out of use these days. They have some flexibility not present in the posting-list data structure — they allow regular expression searches and searches with misspellings, for example, but they're not quite as fast. There has been some recent work with the Burrows-Wheeler Transform, which is based on suffix arrays, providing a compression algorithm in which the compressed file is the full-text index! The best-documented version of this is called the FM-index, although I've heard that there are older versions of the technique, maybe unpublished. Unlike the other techniques described above, though, I think this doesn't actually work when the documents are PDF files or something like that — you could still use the same approach of extracting a text version of each page and indexing that, but you don't get the benefit of compressing the original document.
My acquaintance Tim wrote a really good introductory series of blog postings on search back in 2003, which are still pretty great. They cover this stuff (except for the recent developments) in a lot more depth.
Ravi: Is this the kind of information you're looking for?
Edit: thanks for fixing my formatting, Martin!