views:

47

answers:

5

I have a large amount (several gigabytes worth) of archival data that I want to make available to users and search engines through a web interface. Most of the data will rarely change, so I'm debating the best way to store and deliver the data.

I would like to ensure that the data loads quickly and efficiently so it can easily be viewed by users and indexed by search engines without overloading my server.

Would it be more space and resource efficient to store the data in a MySQL database and dynamically generate the display pages, or pre-fill all of the display pages from the database and store them as static text/html (regenerating the pages every few weeks if necessary)?

+1  A: 

Your main concern is going to be searching and browsing the data. You will probably not want to build that functionality from scratch, but use one or several existing products. Therefore, I would drop the question "files or data bases" and replace it by "what server / browsing / searching system am I going to use?".

There are several powerful Open Source solutions in the area. As just one example, Apache Solr looks like it could be useful to you:

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

Sphinx is another popular Open Source system that is designed to search databases.

Pekka
+1  A: 

a compromise would be to store the data as static files on the server, and store the paths to the files in your database.

A simple "include" should put all that data on your web pages. This way you also avoid duplicating the data on all the pages that you want the data to be present on.

Here Be Wolves
This may be possible, the data will be "chinked" and included in sections on different pages - pre-chunking it into text files should make including it on the pages it needs to be on fairly easy. My only fear is that managing several hundred text files will be a bit harder then managing one large relational database.
MarathonStudios
if you don't want to ever search for content in all those gigs of data, you're better off storing them in files.storing the filenames/filepaths in the database is pretty much optional ... depends upon what you really need to build.BTW, once you've built the several hundred textfiles, you won't have to bother about them ever again.
Here Be Wolves
A: 

I hope you're not putting all that data on one page. If you do, you're going to grind people's web browsers to a halt. If the data is large and doesn't change much, I'd stick with static pages, possibly with programs to regenerate them when the data changes. That's the approach taken by, for instance, the Movable Type blog engine. If the program you use to generate the pages is written correctly, it can be quickly and easily changed to one that generates the pages dynamically on demand.

Paul Tomblin
A: 

I would think it would depend on the number of "display pages" you will have. If there are a relatively small number of interesting pages to display then yeah go ahead. However I'm going to assume that there will be a large number of pages to display (possibly far to many to actually pre-compute).

I would think you would start off by de-normalizing some of your tables into the views you are interested in. This way you can avoid having to join all over the place. After that if performance was still a concern some sort of caching mechanism might be good for the more frequently used pages. (Web cache etc.) Of course I think your database will automatically do some caching of its own.

It's all a trade off and highly dependent on the data.

Jason Tholstrup
A: 

If your main goal is to be indexed by google or like others. You don't need a database. Put your all static data in the pages and build sitemap.xml at the web server's root in order to be indexed by web bots...

Burçin Yazıcı