Sorry for the bad title.
I'm saving web pages. I currently use 1 XML file as an index. One element contains file created date (UTC), full URL (w. query string and what not). And the headers in a separate file with similar name but appended special extension.
However, going at 40k (incl. header) files, the XML is now 3.5 MB. Recently I was still reading, adding new entry, save this XML file. But now I keep it in memory and save it every once in a while.
When I request a page, the URL is looked up using XPath on the XML file, if there is an entry, the file path is returned.
The directory structure is .\www.host.com/randomFilename.randext
So I am looking for a better way.
Im thinking:
- One XML file per. domain (incl. subdomains). But I feel this might be a hassle.
- Using SVN. I just tested it, but I have no experience in large repositories. Executing svn add "path to file" for every download, and commit when I'm done.
- Create a custom file system, where I then can include everything I want, for ex. POST-data.
- Generating a filename from the URL and somehow flattening the querystring, but large querystrings might be rejected by the OS. And if I keep it with the headers, I still need to keep track of multiple files mapped to each different query string. Hassle. And I don't want it to execute too slow either.
Multiple program instances will perform read/write operations, on different computers.
If I follow the directory/file method, I could in theory add a layer between so it uses DotNetZip on the fly. But then again, the query string.
I'm just looking for direction or experience here.
What I also want is the ability to keep history of these files, so the local file is not overwritten, and then I can pick which version (by date) I want. Thats why I tried SVN.