views:

1021

answers:

12

As I often work without a fast or even any internet connection, I have a webserver that serves commonly used documentation, for example:

  • Various programming languages (php, Python, Java, ...)
  • Various libraries (for example pthreads)
  • Various open books
  • RFCs
  • IETF drafts
  • Wikipedia (text-only, the uncompressed English dumpfile weighs 20GB!)
  • Clipart galleries

I use these even when I'm online - less search needed, and I can grep the files if needed. However, this collection takes up lots of space, atm about 30GB, so I'd like to compress it.

Furthermore, I'm looking for a nice way to search through all this stuff. The last time I tried, desktop search engines couldn't really cope with thousands or very,very big files - and I assume that any meaningful index will be bigger than a fraction of the original text. Therefore, I'd like to index only certain areas (for example, only the Wikipedia title, or only title and first paragraph, or only the short function description).

Is there such a solution that allows to search in it, uncompress the needed portion of the compressed file and format¹ it?

¹ for example preserving links in HTML documentation, converting PDF to HTML

+3  A: 

Depending on what OS you're using, Microsoft Compiled HTML Help files might be a good option. PHP has it's documentation available as a CHM. You can get a compiler here.

VirtuosiMedia
I'm mostly using unixoid systems. Plus, looking into the unoffical documentation, CHM seems really convoluted. And it's a closed format.
phihag
CHM files (for various systems - PHP, Smarty, whatever) are usually updated less often than the HTML versions. However, that's what I use - the tidiness of one file and index\search "in the box" make this format a winner for me.
MaxVT
A: 

i store them in raw text form.

  • not bound to any specific company (sun, ms, google)
  • parseable by any computer language or os
  • inherently human readable
  • low overhead

no serious person would disagree

Well, the two main disadvantages are: Hard to search (Can take minutes on a 20GB en.wikipedia dump with a slow laptop HDD) and taking lots of space
phihag
there are lots of things that will take minutes on a laptop with slow hdd... and what makes you think a text file is larger then say pdf or .doc?
@theman_on_vista I don't know about your hardware configuration, but the fastest medium that can store 20GB I got here can't read faster than a couple of 100MB/s, therefore still taking a minute per search. And a text file is of course smaller than any formatted, but way bigger than a compressed ...
phihag
... version. For example, text-only(well, XML, but close to text-only) en.wikipedia is about 5GB as .bz2, but inflates to 20GB when uncompressed.
phihag
i feel like we are arguing 2 completely unrelated things here lol
Point by point:* Neither is HTML or PDF* So what? Are you storing the docs for yourself or for your OS?* Others are better. HTML and any other structured format whatsoever, with headings and images, is more readable.* So what? Hard drives are cheap. Your time is not.Guess I'm not serious :)
MaxVT
maxvt - oh, so you cant read a book unless it has pictures?
+8  A: 

Hmm, I think it's an interesting problem, and I hope to be able to post a "real" answer for it later, after some research.

However, for your case, 30GB of data really is not that much. In terms of effort/cost-to-benefit ratio, the correct solution is probably "buy a new hard drive." 1TB drives are less than $100 now, developing a quality solution for this will likely take a huge amount of effort. Unless you don't value your time much (or just plain enjoy working on the problem), it's much easier to just buy more space.

Chad Birch
A new drive is a pretty good solution, but the one requirement that doesn't solve is making it (quickly and easily) searchable.
digitaljoel
I agree the space is not the problem when I'm hosting the whole thing on a fat server. However, I dream of someday storing the entire Wikipedia and the mentioned docs on my cell phone or a thumb drive.
phihag
If you are able to get a drive as large as 1TB the space required to index all the content for fast searches wouldn't matter.
Bratch
+3  A: 

(This answer assumes you're on Linux or similar.)

For the compression portion, you may want to try something like FuseCompress, which will allow you to mount a directory as a compressed filesystem. Compression happens on write, decompression happens on read. I've never used this, but I've used other fuse-based filesystems in the past with no problems. Some googling turned up some other fuse-based compressed mount options.

This retains readability/parseability/searchability as standard text through the fuse mount point, but could yield some considerable space savings. Access will, however, be slower than just accessing the raw text.

As for searchability, if you keep things accessible as raw text you have a ton of options. Beagle and Tracker come to mind.

Marty Lamb
I wouldn't let Tracker near that much data... I have had bad experiences with Tracker and lots(!) of text files...
Kalmi
afaik, FuseCompress can't jump to arbitrary bytes in the uncompressed stream without uncompressing the whole file, can it?
phihag
I doubt it - if you're just serving the files from your local webserver you don't need that, though. For the bold portion of your question (uncompress just the needed portion and format it) that's probably a blocker.
Marty Lamb
+2  A: 

If it's a Windows server, turn on NTFS compression. It works reasonably well for text. As for very, very big files - you mean the Wikipedia dump? Try to divide it up into smaller chunks and give the desktop search programs another try. Copernic Desktop Search worked very well for me. It has many settings that can help you finetune the indexing "depth" and performance.

TomA
This answer is essentially the same as Marty Lamb's: It works, but not that well
phihag
A: 

The Google search appliances and the supporting applications are somewhat pricey perhaps for your application (maybe?), but they are great and they know how to deal with files in hundreds of different formats. Overall, your storage requirement of 30GB is really not that much space anymore and avoiding compression will make it much easier to index, access, and maintain.

Tall Jeff
+2  A: 

First, as stated by others, 30GB is not much. Just buy another or bigger harddrive. You could also compress it on Windows OS's using the compress file option. Right click on folder/file and click properties. Click on Advanced. Check "Compress contents to save disk space." Click Apply.

For searching, I'm not sure what your OS is or if you can change it, but Microsoft Search Server Express (FREE) might be your best option if stored in one of the many formats that Search Server can search. Since it also uses ifilters you could create your own ifilter if needed too. If you cannot change to a Windows Server OS then use Desktop Search. I prefer developing on a Server OS so for me that doesn't bother me however at somepoint the search service will need to index all that information.

thames
As to buying a bigger hard drive, that may be more difficult if the questioner is using a laptop, and especially so if they are using a solid state drive. SSDs above 100 GB are rather expensive.
Brian Campbell
+2  A: 

One interesting concept I've found was a way to use wikipedia's bzip archive's offline: http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html

this technique could possibly be adapted to any documentation you want

cobbal
+9  A: 

Did you think about using Apache Lucene for this purpose? It is quite easy to index just what you want, and you can also think about writing your own parser for specific files format that are not managed natively by Lucene.

EDIT: You should really think about such a solution since Lucene offers quite an easy API to query the indexes, it is quite well documented and you can find a lot of resources about it on the web. Also as your problematic can interest a lot of people perhaps it can be a good start for an "open-source" project to create a personal search engine. This can be challenging & interesting for a huge amount of people I guess.


Why I chose this answer: I'm going to write an application that does exactly what is asked, probably using Lucene. I expect the two main problems to be selecting what to index (for example, only <title> in the wikipedia dump) and to index compressed files without unpacking them completely (or index while compressing). As soon as I release it I'll add a link here. Thanks for all your answers! --phihag

Matthieu BROUILLARD
Thanks Phihag,Do not hesitate to contact me, the topic is interresting, if I have time to help, I will be pleased to do so.
Matthieu BROUILLARD
+3  A: 

Cross platform (well, Linux/Windows at least) solution:

  1. Keep the documents in their original format (be it text, html, pdf ...) so you don't lose formatting.
  2. Compress them using your favorite compression algorithm and put them in smaller (maybe thematic?) archives ...
  3. Use Xapian (Python, Perl, C++ ...) to build a full-text index over all your documents. If it's only 30Gb the probabilistic text search will be blazingly fast. Be sure to store a reference to the path of the individual files in your Xapian indexes too.

Links:

ChristopheD
+2  A: 

I had this same problem

dont go writing anything. its been done :)

doxmentor4j allows portable library transport (take your repository with you and use it anywhere) and uses lucene as its engine.. chm files you name it, lucene eats it

enjoy!

Bob Blanchett