views:

258

answers:

8

Here's a tricky one - how do I programatically create and interrogate a database whose contents I can't really foresee?

I am implementing a generic input form system. The user can create PHP forms with a WYSIWYG layout and use them for any purpose he wishes. He can also query the input.

So, we have three stages:

  1. a form is designed and generated. This is a one-off procedure, although the form can be edited later. This designs the database.
  2. someone or several people make use of the form - say for daily sales reports, stock keeping, payroll, etc. Their input to the forms is written to the database.
  3. others, maybe management, can query the database and generate reports.

Since these forms are generic, I can't predict the database structure - other than to say that it will reflect HTML form fields and consist of a the data input from collection of edit boxes, memos, radio buttons and the like.

Questions and remarks:

A) how can I best structure the database, in terms of tables and columns? What about primary keys? My first thought was to use the control name to identify each column, then I realized that the user can edit the form and rename, so that maybe "name" becomes "employee" or "wages" becomes ":salary". I am leaning towards a unique number for each.

B) how best to key the rows? I was thinking of a timestamp to allow me to query and a column for the row Id from A)

C) I have to handle column rename/insert/delete. Foe deletion, I am unsure whether to delete the data from the database. Even if the user is not inputting it from the form any more he may wish to query what was previously entered. Or there may be some legal requirements to retain the data. Any gotchas in column rename/insert/delete?

D) For the querying, I can have my PHP interrogate the database to get column names and generate a form with a list where each entry has a database column name, a checkbox to say if it should be used in the query and, based on column type, some selection criteria. That ought to be enough to build searches like "position = 'senior salesman' and salary > 50k".

E) I probably have to generate some fancy charts - graphs, histograms, pie charts, etc for query results of numerical data over time. I need to find some good FOSS PHP for this.

F) What else have I forgotten?

This all seems very tricky to me, but I am database n00b - maybe it is simple to you gurus?


Edit: please don't tell me not to do it. I don't have any choice :-(

Edit: in real life I don't expect column rename/insert/delete to be frequent. However it is possible that after running for a few months a change to the database might be required. I am sure this happens regularly. I fear that I have worded this question badly and that people think that changes will be made willy-nilly every 10 minutes or so.

Realistically, my users will define a database when they lay out the form. They might get it right first time and never change it - especially if they are converting from paper forms. Even if they do decide to change, this might only happen once or twice ever, after months or years - and that can happen in any database.

I don't think that I have a special case here, nor that we should be concentrating on change. Perhaps better to concentrate on linkage - what's a good primary key scheme? Say, perhaps, for one text input, one numerical and a memo?

+1  A: 

This is for A) & B), and is not something I have done but thought it was an interesting idea that Reddit put to use, see this link (look at Lesson 3):

http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html

Kerry
That may work great for low-importance social bookmarking type apps but for a payroll app it would be a complete disaster.
Aaronaught
Mawg
You also note that any mention about reports or analysis against the Reddit database are suspiciously omitted.
Thomas
+1  A: 

Not sure about the database but for charts instead of using PHP for the charts, I recommend looking into using javascript (http://www.reynoldsftw.com/2009/02/6-jquery-chart-plugins-reviewed/). Advantages to this are some of the processing is offloaded to the client side for chart displays and they can be interactive.

mOX
Thanks. I am generally wary of client-side stuff as it can be turned off, hacked,etc, but I will look into this.
Mawg
+6  A: 

In essence, you are asking how to build an application without specifications. Relational databases were not designed so that you can do this effectively. The common approach to this problem is an Entity-Attribute-Value design and for the type of system in which you want to use it, the odds of failure are nearly 100%.

It makes no sense for example, that the column called "Name" could become "Salary". How would a report where you want the total salary work if the salary values could have "Fred", "Bob", 100K, 1000, "a lot"? Databases were not designed to let anyone put anything anywhere. Successful database schemas require structure which means effort with respect to specifications on what needs to be stored and why.

Therefore, to answer your question, I would rethink the problem. The entire approach of trying to make an app that can store anything in the universe is not a recipe for success.

Thomas
"asking how to build an application without specifications". Well, the app has specs. And the db will too, when the input form is generated, that defines the database."It makes no sense for example, that the column called "Name" could become "Salary". That does seem unlikely; my examples are more realistic, but, of course I can't prevent the user for doing stupid things. That is why the column name cannot be the primary key. But I don't think it unreasonable for someone to decide to rename an input field on a form. +1 for taking the time to reply, Thanks
Mawg
please see edit at end of original question. Thanks.
Mawg
@mawg - You are lacking specifications about the most critical piece of any database driven application: the data itself. Where it will really unravel is reporting. In essence, you will need to ask for the schema before you can build the data that will drive a report. Storage is only a small part of the puzzle. What will kill the project is performance and maintenance. Here is a great article about one guy's experience with an EAV and note the great analogy he uses. http://www.simple-talk.com/opinion/opinion-pieces/bad-carma/
Thomas
@mawg - If changes to the schema are expected to be infrequent, why can't specifications be gathered ahead of time? It would be far better and cheaper to build a normalized database with standard columns. In fact, it would be cheaper to custom build three different databases than to build one EAV to rule them all to store all three system's information.
Thomas
Mawg
@mawg - Then it sounds like management wants the same type of application mentioned in the article link I provided in an earlier post. It would make a good case study read for them to understand why such a system is untenable and more importantly, unprofitable.
Thomas
If I were you, I run as fast as I can from this disaster project.
HLGEM
+1  A: 

This is almost always a bad thing. Check out this thread for more responses:

http://stackoverflow.com/questions/66385/dynamic-database-schema/

Matt Rogish
Good link +1, thanks, but please also see edit at end of original question. Thanks.
Mawg
+2  A: 

Like Thomas said, rational database is not good at your problem. However, you may want to take a look at NoSQL dbs like MongoDB.

Codism
+1 thanks, but it has to be ODBC compliant
Mawg
+4  A: 

"This all seems very tricky to me, but I am database n00b - maybe it is simple to you gurus?"

Nope, it really is tricky. Fundamentally what you're describing is not a database application, it is a database application builder. In fact, it sounds as if you want to code something like Google App Engine or a web version of MS Access. Writing such a tool will take a lot of time and expertise.

Google has implemented flexible schemas by using its BigTable platform. It allows you to flex the schema pretty much at will. The catch is, this flexibility makes it very hard to write queries like "position = 'senior salesman' and salary > 50k".

So I don't think the NoSQL approach is what you need. You want to build an application which generates and maintains RDBMS schemas. This means you need to design a metadata repository from which you can generate dynamic SQL to build and change the users' schemas and also generate the front end.

Things your metadata schema needs to store

For schema generation:

  • foreign key relationships (an EMPLOYEE works in a DEPARTMENT)
  • unique business keys (there can be only one DEPARTMENT called "Sales")
  • reference data (permitted values of EMPLOYEE.POSITION)
  • column data type, size, etc
  • whether column is optional (i.e NULL or NOT NULL)
  • complex business rules (employee bonuses cannot exceed 15% of their salary)
  • default value for columns

For front-end generation

  • display names or labels ("Wages", "Salary")
  • widget (drop down list, pop-up calendar)
  • hidden fields
  • derived fields
  • help text, tips
  • client-side validation (associated JavaScript, etc)

That last points to the potential complexity in your proposal: a regular form designer like Joe Soap is not going to be able to formulate the JS to (say) validate that an input value is between X and Y, so you're going to have to derive it using templated rules.

These are by no means exhaustive lists, it's just off the top of my head.

For primary keys I suggest you use a column of GUID datatype. Timestamps aren't guaranteed to be unique, although if you run your database on an OS which goes to six places (i.e. not Windows) it's unlikely you'll get clashes.

last word

'My first thought was to use the control name to identify each column, then I realized that the user can edit the form and rename, so that maybe "name" becomes "employee" or "wages" becomes ":salary". I am leaning towards a unique number for each.'

I have built database schema generators before. They are hard going. One thing which can be tough is debugging the dynamic SQL. So make it easier on yourself: use real names for tables and columns. Just because the app user now wants to see a form titled HEADCOUNT it doesn't mean you have to rename the EMPLOYEES table. Hence the need to separate the displayed label from the schema object name. Otherwise you'll find yourself trying to figure out why this generated SQL statement failed:

update table_11123
set col_55542 = 'HERRING'
where col_55569 = 'Bootle'
/

That way madness lies.

APC
+1 Thanks for some great input,e specially the last word
Mawg
+1  A: 

The other respondents are correct that you should be very cautious with this approach because it is more complex and less performant than the traditional relational model - but I've done this type of thing to accommodate departmental differences at work, and it worked fine for the amount of use it got.

Basically I set it up like this, first - a table to store some information about the Form the user wants to create (obviously, adjust as you need):

--************************************************************************
-- Create the User_forms table
--************************************************************************
create table User_forms
    (
    form_id            integer identity,
    name               varchar(200),
    status             varchar(1),
    author             varchar(50),
    last_modifiedby    varchar(50),
    create_date        datetime,
    modified_date      datetime
    )

Then a table to define the fields to be presented on the form including any limits and the order and page they are to be presented (my app presented the fields as a multi-page wizard type of flow).

-

-************************************************************************
-- Create the field configuration table to hold the entry field configuration
--************************************************************************
create table field_configuration
    (
    field_id                integer identity,
    form_id                 SMALLINT,
    status                  varchar(1),
    fieldgroup              varchar(20),
    fieldpage               integer,
    fieldseq                integer,
    fieldname               varchar(40),
    fieldwidth              integer,
    description             varchar(50),
    minlength               integer,
    maxlength               integer,
    maxval                  varchar(13),
    minval                  varchar(13),
    valid_varchars             varchar(20),
    empty_ok                varchar(1),
    all_caps                varchar(1),
    value_list              varchar(200),
    ddl_queryfile           varchar(100),
    allownewentry           varchar(1),
    query_params            varchar(50),
    value_default           varchar(20)
    );

Then my perl code would loop through the fields in order for page 1 and put them on the "wizard form" ... and the "next" button would present the page 2 fields in order etc.

I had javascript functions to enforce the limits specified for each field as well ...

Then a table to hold the values entered by the users:

--************************************************************************
-- Field to contain the values
--************************************************************************
create table form_field_values
    (
    session_Id        integer identity,
    form_id           integer,
    field_id          integer,
    value             varchar(MAX)
    );

That would be a good starting point for what you want to do, but keep an eye on performance as it can really slow down any reports if they add 1000 custom fields. :-)

Ron Savage
+1 Ron, this may well get awarded the answer. I really don't see the problem (maybe because I am a n00b) and suspect that I am just explaining it badly.
Mawg
It's not so much of a problem - just that most "run time configurable" type of systems don't fit into a lot of developers object/data model mentality. As with any type of system, there are trade-offs. :-)
Ron Savage
@Ron Savage - It is not that the system does not fit within developers' mentality it is that the schema does not fit within the primary designs of a relational model. The effect of that is that reporting or querying against the data, i.e., getting information from your data, is a bear and performs poorly. Now, if you never have to report against the data, that is entirely different.
Thomas
+1  A: 

Hi Mawg,

See this article: http://www.simple-talk.com/opinion/opinion-pieces/bad-carma/ for someone else's experience of your problem.

Mark Bannister