As I often work without a fast or even any internet connection, I have a webserver that serves commonly used documentation, for example:
- Various programming languages (php, Python, Java, ...)
- Various libraries (for example pthreads)
- Various open books
- RFCs
- IETF drafts
- Wikipedia (text-only, the uncompressed English dumpfile weighs 20GB!)
- Clipart galleries
I use these even when I'm online - less search needed, and I can grep the files if needed. However, this collection takes up lots of space, atm about 30GB, so I'd like to compress it.
Furthermore, I'm looking for a nice way to search through all this stuff. The last time I tried, desktop search engines couldn't really cope with thousands or very,very big files - and I assume that any meaningful index will be bigger than a fraction of the original text. Therefore, I'd like to index only certain areas (for example, only the Wikipedia title, or only title and first paragraph, or only the short function description).
Is there such a solution that allows to search in it, uncompress the needed portion of the compressed file and format¹ it?
¹ for example preserving links in HTML documentation, converting PDF to HTML