views:

157

answers:

3

First let me describe my situation so that you might be able to help me better. There are two parts.

1: I have a program that runs and analyzes a bunch of files. It generates a "report" that will later be fed into a website for DB storage and viewing. This report can contain pretty much any type of data, as the users can query pretty much anything. I left it very open ended.

2: The website parses through this report, adds an entry for things that are common. But also creates a new table for any new data it finds. It also stores a mapping from report_id to all of these dynamically created tables. For example if in the report someone wanted to calculate Standard Deviation, and this made sense for this report, than there would be a STD table.

Right now this site is written in PHP, and looks kind of messy. Is there a better way to do this PHP. Also, I'm considering reworking this in Rails because organization sake. Is there a better way in rails, "method_missing?".

I am not very skilled at building websites, and amateurish at DB, so please be kind.

Thanks Eric

A: 

Modifying the structure of your database at run-time is very dangerous. I would suggest storing unknown or newly encountered information in a "mixed" type table, where you could also give a "tag" to information to recognize it. If you meet a new type of information (such as a standard deviation), you could create a new tag (a bit like file extensions) and add the information in the information table with the tag associated to it. If you create a new table, you change the schema and therefore, you change the rules of use of the database.

JP
A: 

Automatically creating tables could give you headaches if the number of tables becomes vast. Directory listings in ext3 for dirs with over 5000 items start to feel expensive, and certainly are rather time consuming when you get over 100,000 files. (The mysql CLI will attempt to cache all table names when it connects, and to skip that, you need to connect with the -A switch.)

You might consider using temporary tables for the creation of the report, and then possibly condensing the results down into a report string for retrieval later. Or, as JP mentions, a table structure that tags values and reports in the same table:

create table stddev (
    report_id int(11),
    field_name int(11), -- fk to field
    field_value double
);
create table reports (
    report_id int(11);
    report_name varchar(255);
);

And to get data for just one report, you would do a select specifying the report_id:

select * from stddev where report_id = 123;

I would create tables for your report names, field names, and you probably want to separate input values from derived/computed values saved for your reports.

Depending on how often you input new data, I wouldn't prematurely optimize from a few large tables to lots of small tables. Properly indexed, a large table can perform well.

However, what is the magnitude of data this application is processing? Were there reasons for using so many small tables to begin with?

I use PHP to process a lot of data. If your schema make sense, it makes the PHP code make more sense, too. If it were me, I wouldn't switch to a different programming language, I would stick with what I started until I came upon a real structural limitation of the language; for me switching to Rails would be a waste of time.

memnoch_proxy
+2  A: 

Looks like that you have unstructured, or at best semi-structured data. Classic relational DB tables are not ideal for this, unless you use XML as storage. You may consider coming up with an intermediate report-definition language and then storing that in a DB, usually as XML. MS SQL server, Oracle, and DB2 support storage and query of XML data.

After some thinking..
You could also look into the observation pattern and see if you could adapt it to this example. Although the pattern is OO, it can be done in SQL.
It is a bit long explanation, but I have published a simplified model here; hope you find this useful.

alt text

Damir Sudarevic
You are right, this would definitely work. My problem really boils down to not defining all the type of data that I want to be able to process. Thanks a bunch.
esiegel