views:

393

answers:

7

If I know the correct format of fields, should I create check constraints for all of those fields, or will this affect the performance of inserts/updates too much? Would it be a good idea to use regular expressions for complex rules, or should I only use simple constraints like case and length?

The fields are already being validated at the application level.

+6  A: 

In general it is best to not trust the application and put in the check constraints, the data must maintain integrity (who knows what rogue script may run, or program bug that may slip through). However if you have many complex check constraints or just many and you notice an insert/update slowdown, you may want to reevaluate. Is it really necessary to have one on each field though, no. Actually, the column data type and length act as constraints too.

KM
A: 

This depends on your paranoia level.

Of course double-checks are better than single-checks, but the checking on the client side has the benefit or parallelization.

Client side checks are performed by many, possibly thousands of computers that your clients use, while server-side checks are performed by the single server.

I just run a test on my Oracle 10g:

CREATE TABLE t_check (value VARCHAR2(50) NOT NULL, CHECK(REGEXP_LIKE(value, '^[0-9]{1,10}$')))

INSERT
INTO    t_check
SELECT  level
FROM    dual
CONNECT BY
        level <= 1000000

With a CHECK, this runs for 27 seconds, without one, it takes but 2 seconds.

If this is problem for you and if you are absolutely sure that no value will ever get into your database unless it's checked by the client software, then rely on the client side.

Quassnoi
+1  A: 

"This depends on your paranoia level.

Of course double-checks are better than single-checks, but the checking on the client side has the benefit or parallelization.

Client side checks are performed by many, possibly thousands of computers that your clients use, while server-side checks are performed by the single server."

While none of this is untrue per se, this answer seems to unproportionally stress the importance of those 25 seconds, and thus seems rather biased toward "rely on clients". That is unwise, period. Especially if the cost for a total of a million inserts is as negligible as 25 seconds. You never know for certain whether ALL clients will correctly implement all needed checks, and even if you DO know that for the clients that CURRENTLY EXIST, even then you don't know about any future other clients.

What you must consider is the repairing cost you will incur when your data gets "corrupted" as a consequence of some constraint that wasn't enforced by the database system. For example, ponder if the "poor guy" who had to solve the following problem (http://stackoverflow.com/questions/1408023/find-guid-in-database), would be done in 25 seconds.

If the sum-total of time needed to do ALL the constraint checking makes your transactions noticeably slower, even then it is probably more advisable to try and convince your organization of investing in more or faster hardware.

Data that is guaranteed to satisfy integrity rules is the most important asset in most businesses today, and should be cherished as such.

Erwin Smout
"That is unwise, period. Especially if the cost for a total of a million inserts is as negligible as 25 seconds." That depends on how often you insert a million records, doesn't it. Quassnoi points out that under certain circumstances you might consider not to use a check constraint. Valid point IMHO.
Robert Merkwürdigeliebe
A: 

It really depends on your overall architecture. If your database is being used strictly as a data store then there should not be no checks on the database.

The fact that you're asking this question at all suggests that you're using the database in a more traditional way. In that case, it is usually best to add as many constraints on the database as possible then profile to see which ones need to be removed to improve performance (at which point you need to decide whether the performance boost is worth the reduced security).

Jeff Hornby
Did you mean that double negative? "If your database is being used strictly as a data store then there should not be no checks on the database. "
APC
+4  A: 

Quassnoi's points regarding are valid but it's worth remembering that not all CHECK constraints are equal. In the following tests I benchmarked the REGEXP_LIKE() check against two more "old-fashioned" checks; the first converts the value into a string of zeroes and then does an equality check, and the second does a range check using BETWEEN().

"Wall clock" tests are sensitive to ambient conditions (such as a checkpoint firing) so we need to run them several times. The other thing to bear in mind is that performance can vary from version to version. For instance, I'm running on 11g and the regex check ran consistently at 9-10 seconds, which suggests Oracle have optimized it quite considerably since 10g. On the other hand the unchecked inserts ran at 1.7 - 2-ish seconds, so regex is still relatively expensive. The other checks weighed in at about 2.5 - 3.0 seconds, which is roughly a 50% hit for integrity.

Whether it is worth paying that toll really depends on how you feel about your data. Experience says relying on the client to enforce the data rules inevitably means that at some point the rules will be broken. Either because a developer omits to apply them or deletes them from the application. Or because somebody attaches to the database a fresh client application (e.g. batch upload, web service) which does not include those rules.

Finally, most applications are not going to be loading one million rows at a time. Compared to the network round trip, the microseconds required to apply checks to a single insert or upload are probably a trivial overhead.

SQL> CREATE TABLE t_check (value VARCHAR2(50))
  2  /

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> INSERT
  2  INTO    t_check
  3  SELECT  level
  4  FROM    dual
  5  CONNECT BY
  6          level <= 1000000
  7  /

1000000 rows created.

Elapsed: 00:00:01.68
SQL>
SQL> prompt Regex check
Regex check
SQL>
SQL> drop table t_check
  2  /

Table dropped.

Elapsed: 00:00:00.37
SQL> CREATE TABLE t_check (value VARCHAR2(50) NOT NULL
  2        , CHECK(REGEXP_LIKE(value, '^[0-9]{1,10}$')))
  3  /

Table created.

Elapsed: 00:00:00.07
SQL>
SQL> INSERT
  2  INTO    t_check
  3  SELECT  level
  4  FROM    dual
  5  CONNECT BY
  6          level <= 1000000
  7  /

1000000 rows created.

Elapsed: 00:00:09.53
SQL>
SQL> prompt old fashioned "mask" check
old fashioned "mask" check
SQL>
SQL> drop table t_check
  2  /

Table dropped.

Elapsed: 00:00:00.59
SQL> CREATE TABLE t_check
  2      (value VARCHAR2(50) NOT NULL
  3          , CHECK(translate(lpad(value, 20, '0')
  4              , '1234567890', '0000000000') = '00000000000000000000' ))
  5  /

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> INSERT
  2  INTO    t_check
  3  SELECT  level
  4  FROM    dual
  5  CONNECT BY
  6          level <= 1000000
  7  /

1000000 rows created.

Elapsed: 00:00:02.82
SQL>
SQL> prompt old fashioned "range" check
old fashioned "range" check
SQL>
SQL> drop table t_check
  2  /

Table dropped.

Elapsed: 00:00:00.39
SQL> CREATE TABLE t_check
  2      (value VARCHAR2(50) NOT NULL
  3          , CHECK( value between 1 and 1000000))
  4  /

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> INSERT
  2  INTO    t_check
  3  SELECT  level
  4  FROM    dual
  5  CONNECT BY
  6          level <= 1000000
  7  /

1000000 rows created.

Elapsed: 00:00:02.23
SQL>
APC
"Finally, most applications are not going to be loading one million rows at a time..." Quite: imagine we load 1 millions rows per year, it clearly isn't worth risking data integrity to save 27 seconds spread over a year!
Tony Andrews
The applications that load a million records at a time are the ones that most need the check constraints because those imports are likely not coming from the application but from an ETL application instead. With no constraints it is very easy to load bad data as the ETL specialists are less likely to be aware of application business rules that should be applied.
HLGEM
A: 

My real question to you is, can data enter your database from a source other than the application? Will there be large imports or queries run from other applications or from a query window. If you got a new client who wanted to import records from thier previous vendor how would that be handled? If you can conceive opf any reason why the data might be changed from outside the application and the rules must be enforced for all data, then you need check constraints or triggers to enforce that. How important is the format for the way the rest of the application works? Forinstance if you store all phone numbers as only numbers and add the formatting on the page displaying the data, what will be the effect if someone adds a phone number of (111) 111-1111 to your displayed information. Data integrity is a critcal part of databases. If you can't rely onthe data to be in the correct format or follow the correct rules (only three correct values for a field for instance) then your dqata becomes less valuable. Depending on the severity, it could be a critical failure.

Another thing to consider is do you have people who can directly access the tables who are not admins? If so, rules enforced at the database level can help reduce the possibility fraud. (Developer often forget to protect the data from authorized users. The rules I'm talking about here are more likely enforced through triggers than simple check constraints except possibly top limits on charges, but it is something to consider when designing the database)

HLGEM
A: 

Trust is good, control is better.

Often you don't need reg exps at all. See here for an example: http://stackoverflow.com/questions/1348804/regular-expressions-at-end-of-string/1351068#1351068 . Substr, translate, instr...are powerful and fast.

tuinstoel