views:

883

answers:

7

Hi

In our product we have a generic search engine, and trying to optimze the search performance. A lot of the tables used in the queries allow null values. Should we redesign our table to disallow null values for optimization or not?

Our product runs on both Oracle and MS SQL Server.

+2  A: 

Short answer: yes, conditionally!

The main issue with null values and performance is to do with forward lookups.

If you insert a row into a table, with null values, it's placed in the natural page that it belongs to. Any query looking for that record will find it in the appropriate place. Easy so far....

...but let's say the page fills up, and now that row is cuddled in amongst the other rows. Still going well...

...until the row is updated, and the null value now contains something. The row's size has increased beyond the space available to it, so the DB engine has to do something about it.

The fastest thing for the server to do is to move the row off that page into another, and to replace the row's entry with a forward pointer. Unfortunately, this requires an extra lookup when a query is performed: one to find the natural location of the row, and one to find its current location.

So, the short answer to your question is yes, making those fields non-nullable will help search performance. This is especially true if it often happens that the null fields in records you search on are updated to non-null.

Of course, there are other penalties (notably I/O, although to a tiny extent index depth) associated with larger datasets, and then you have application issues with disallowing nulls in fields that conceptually require them, but hey, that's another problem :)

Jeremy Smyth
Setting those columns NOT NULL won't solve the "row migration" problem: if the information is not known at the time of inserting, another default value will be entered (like '.') and you will still have rows migrated when real data will replace the default value. In Oracle you would set PCTFREE appropriately to prevent row migration.
Vincent Malgrat
+2  A: 

In Oracle, NULL values are not indexed, i. e. this query:

SELECT  *
FROM    table
WHERE   column IS NULL

will always use full table scan since index doesn't cover the values you need.

More than that, this query:

SELECT  column
FROM    table
ORDER BY
        column

will also use full table scan and sort for same reason.

If your values don't intrinsically allow NULL's, then mark the column as NOT NULL.

Quassnoi
How will the same queries effect a MS SQL SERVER?
Jakob Ojvind Nielsen
SQL Server does index NULL's
Quassnoi
You can get around this limitation with a function-based index in which you include a literal value, such as CREATE INDEX MY_INDEX ON MY_TABLE (MY_NULLABLE_COLUMN, 0)
David Aldridge
+3  A: 

If your column doesn't contain NULLs it is best to declare this column NOT NULL, the optimizer may be able to take more efficient path.

However, if you have NULLs in your column you don't have much choice (a non-null default value may create more problems than it solves).

As Quassnoi mentionned, NULLs are not indexed in Oracle, or to be more precise, a row won't be indexed if all the indexed columns are NULL, this means:

  • that NULLs can potentially speed up your research because the index will have fewer rows
  • you can still index the NULL rows if you add another NOT NULL column to the index or even a constant.

The following script demonstrates a way to index NULL values:

CREATE TABLE TEST AS 
SELECT CASE
          WHEN MOD(ROWNUM, 100) != 0 THEN
           object_id
          ELSE
           NULL
       END object_id
  FROM all_objects;

CREATE INDEX idx_null ON test(object_id, 1);

SET AUTOTRACE ON EXPLAIN

SELECT COUNT(*) FROM TEST WHERE object_id IS NULL;
Vincent Malgrat
A: 

In my experience NULL is a valid value and usually means "don't know". If you don't know then it really is pointless to make up some default value for the column or to try to enforce some NOT NULL constraint. NULL just happens to be a specific case.

The real challenge for NULLs is it complicate retrieval a bit. For instance you can not say WHERE column_name IN (NULL,'value1','value2').

Personally if you find lot of your columns, or certain columns contain a lot of NULLs I think you might want to revisit your data model. Maybe those null columns can be put into a child table? For example: a table with phone numbers where it's name, homephone, cellphone, faxno, worknumber, emergencynumber etc... You may only populate one or two of those and it would better normalizing it.

What you need to do is step back and see how the data will be accessed. Is this a column that should have a value? Is this a column that only has a value for certain cases? Is this a column that will be queried a lot?

David
I only use null to express a nonexistent foreign key (e.g., a "Discount Coupon" foreign key on a invoice item table might not exist). However, I don't use nulls in non-foreign key columns; as you say, it "usually" means don't know. The problem with nulls is that they can mean several things - "unknown", "not applicable", "does not exist" (my case), etc. In non-key cases, you will always have to map a name to the NULL field when you finally get around to using it. Better to have that mapping valued defined in the column itself as a real value rather than duping the mapping everytwhere.
Steve Broberg
+1  A: 

The issue of whether to use Nulls because they affect performance is one of those balancing acts of database design. You have to balance business needs against performance.

Nulls should be used if they are needed. For instance, you may havea a begin date and an end date in a table. You often would not know the end date at the time the record is created. Therefore you must allow nulls whether they affect performance or not as the data is simply not there to be put in. However, if the data must, by the business rules, be there at the time the record is created, then you should not allow nulls. This would improve performance, make coding a bit simpler, and make sure the data integrity is preserved.

If you have existing data that you would like to change to no longer allow nulls, then you have to consider the impact of that change. First, do you know what value you need to put into the records which are currently null? Second, do you have a lot of code that is using isnull or coalesce which you need to update (these things slow performance, so if you no longer need to check for them, you should change the code)? DO you need a default value? Can you really assign one? If not will some of the insert or update code break if it is not considering that the field can no longer be null. Sometimes people will put in bad information to allow them to get rid of nulls. So now the price field needs to contain decimal values and things like 'unknown' and thus can't properly be a decimal datatype and then you have to go to all sorts of lengths in order to do calculations. This often creates performance problems as bad or worse than the null created. PLus you need to go through all your code and where ever you used a refernce to the filed being null or not being null, you need to rewrite to exclude or include based on the possible bad values someone will put in becasue the data is not allowed to be null.

I do a lot of data imports from client data and every time we get a file where some field that should allow nulls does not, we get garbage data that needs to be cleaned up before we import to our system. Email is one of these. Often the data is input not knowing this value and it's generally some type of string data, so the user can type anything in here. We go to import emails and find things "I don't know". Tough to try to actually send an email to "I don't know". If the system requres a valid email address and checks for something like the existance of an @ sign, we would get '[email protected]" How is garbage data like this useful to the users of the data?

Some of the performance issues with nulls are a result of writing nonsargable queries. Sometimes just rearranging the where clause rather than eliminating a necessary null can improve the performance.

HLGEM
A: 

Nullable fields can have a big impact on performance when doing "NOT IN" queries. Because rows with all indexed fields set to null aren't indexed in a B-Tree indexes, Oracle must do a full table scan to check for null entires, even when a index exists.

For example:

create table t1 as select rownum rn from all_objects;

create table t2 as select rownum rn from all_objects;

create unique index t1_idx on t1(rn);

create unique index t2_idx on t2(rn);

delete from t2 where rn = 3;

explain plan for
select *
  from t1
 where rn not in ( select rn
                     from t2 );

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 50173 |   636K|  3162   (1)| 00:00:38 |
|*  1 |  FILTER            |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 50205 |   637K|    24   (5)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| T2   | 45404 |   576K|     2   (0)| 00:00:01 |
---------------------------------------------------------------------------

The query has to check for null values so it has to do a full table scan of t2 for each row in t1.

Now, if we make the fields not nullable, it can use the index.

alter table t1 modify rn not null;

alter table t2 modify rn not null;

explain plan for
select *
  from t1
 where rn not in ( select rn
                     from t2 );

-----------------------------------------------------------------------------
| Id  | Operation          | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |  2412 | 62712 |    24   (9)| 00:00:01 |
|   1 |  NESTED LOOPS ANTI |        |  2412 | 62712 |    24   (9)| 00:00:01 |
|   2 |   INDEX FULL SCAN  | T1_IDX | 50205 |   637K|    21   (0)| 00:00:01 |
|*  3 |   INDEX UNIQUE SCAN| T2_IDX | 45498 |   577K|     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------
Daniel Emge
+1  A: 

An extra answer to draw some extra attention to David Aldridge's comment on Quassnoi's accepted answer.

The statement:

this query:

SELECT * FROM table WHERE column IS NULL

will always use full table scan

is not true. Here is the counter example using an index with a literal value:

SQL> create table mytable (mycolumn)
  2  as
  3   select nullif(level,10000)
  4     from dual
  5  connect by level <= 10000
  6  /

Table created.

SQL> create index i1 on mytable(mycolumn,1)
  2  /

Index created.

SQL> exec dbms_stats.gather_table_stats(user,'mytable',cascade=>true)

PL/SQL procedure successfully completed.

SQL> set serveroutput off
SQL> select /*+ gather_plan_statistics */ *
  2    from mytable
  3   where mycolumn is null
  4  /

  MYCOLUMN
----------


1 row selected.

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
  2  /

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------
SQL_ID  daxdqjwaww1gr, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ *   from mytable  where mycolumn
is null

Plan hash value: 1816312439

-----------------------------------------------------------------------------------
| Id  | Operation        | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |      1 |        |      1 |00:00:00.01 |       2 |
|*  1 |  INDEX RANGE SCAN| I1   |      1 |      1 |      1 |00:00:00.01 |       2 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("MYCOLUMN" IS NULL)


19 rows selected.

As you can see, the index is being used.

Regards, Rob.

Rob van Wijk