views:

102

answers:

2

Sorry for the bad title.

I'm saving web pages. I currently use 1 XML file as an index. One element contains file created date (UTC), full URL (w. query string and what not). And the headers in a separate file with similar name but appended special extension.

However, going at 40k (incl. header) files, the XML is now 3.5 MB. Recently I was still reading, adding new entry, save this XML file. But now I keep it in memory and save it every once in a while.

When I request a page, the URL is looked up using XPath on the XML file, if there is an entry, the file path is returned.

The directory structure is .\www.host.com/randomFilename.randext

So I am looking for a better way.

Im thinking:

  • One XML file per. domain (incl. subdomains). But I feel this might be a hassle.
  • Using SVN. I just tested it, but I have no experience in large repositories. Executing svn add "path to file" for every download, and commit when I'm done.
  • Create a custom file system, where I then can include everything I want, for ex. POST-data.
  • Generating a filename from the URL and somehow flattening the querystring, but large querystrings might be rejected by the OS. And if I keep it with the headers, I still need to keep track of multiple files mapped to each different query string. Hassle. And I don't want it to execute too slow either.

Multiple program instances will perform read/write operations, on different computers.

If I follow the directory/file method, I could in theory add a layer between so it uses DotNetZip on the fly. But then again, the query string.

I'm just looking for direction or experience here.

What I also want is the ability to keep history of these files, so the local file is not overwritten, and then I can pick which version (by date) I want. Thats why I tried SVN.

A: 

I would recommend either a relational database or a version control system.

You might want to use SQL Server 2008's new FILESTREAM feature to store the files themselves in the database.

SLaks
MySQL also has the type BLOB, which can be used to store binary data within the database as well.
Fiarr
A: 

I would use 2 data stores, one for the raw files and another for indexes.

To stored the flat file, I think Berkeley DB is a good choice, the key can be generated by md5 or other hash function, and you can also compress the content of the file to save some disk space.

For indexes, you can use relational database or more sophisticated text search engine like Lucene.

Tony