ansaurus

Question

Is it better to filter a resultset using a WHERE clause or using application code?

Answer 1

+9 A:

The rule of thumb for any application is to let the DB do the things it does well: filtering, sorting, and joining.

Separate the queries into their own functions or class methods:

$men = $foo->fetchMaleUsers();
$women = $foo->fetchFemaleUsers();

Update

I took Steven's PostgreSQL demonstration of a full table scan query performing twice as good as two separate indexed queries and mimicked it using MySQL (which is used in the actual question):

Schema

CREATE TABLE `gender_test` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `gender` enum('male','female') NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=26017396 DEFAULT CHARSET=utf8

I changed the gender type to not be a VARCHAR(20) as it is more realistic for the purpose of this column, I also provide a primary key as you would expect on a table instead of an arbitrary DOUBLE value.

Unindexed Results

mysql> select sql_no_cache * from gender_test WHERE gender = 'male';

12995993 rows in set (31.72 sec)

mysql> select sql_no_cache * from gender_test WHERE gender = 'female';

13004007 rows in set (31.52 sec)

mysql> select sql_no_cache * from gender_test;

26000000 rows in set (32.95 sec)

I trust this needs no explanation.

Indexed Results

ALTER TABLE gender_test ADD INDEX (gender);

...

mysql> select sql_no_cache * from gender_test WHERE gender = 'male';

12995993 rows in set (15.97 sec)

mysql> select sql_no_cache * from gender_test WHERE gender = 'female';

13004007 rows in set (15.65 sec)

mysql> select sql_no_cache * from gender_test;

26000000 rows in set (27.80 sec)

The results shown here are radically different from Steven's data. The indexed queries perform almost twice as fast as the full table scan. This is from a properly indexed table using common sense column definitions. I don't know PostgreSQL at all, but there must be some significant misconfiguration in Steven's example to not show similar results.

Given PostgreSQL's reputation for doing things better than MySQL, or at least as good as, I daresay that PostgreSql would demonstrate similar performance if properly used.

Also note, on this same machine an overly simplified for loop doing 52 million comparisons takes an additional 7.3 seconds to execute.

<?php
$N = 52000000;
for($i = 0; $i < $N; $i++) {
    if (true == true) {
    }
}

I think it's rather obvious what is the better approach given this data.

hobodave 2010-02-24 06:19:37

+1: Well said. "let the DB do the things it does well".

Guru 2010-02-24 06:59:26

To make it clear to the questioner, although his app may work well for small result sets, once the data in the database grows he'll hit a massive scaling issue unless he lets the db do as you say 'things it does well'.

Robin 2010-02-24 08:04:14

Answer 2

A:

If you have 1 million users, do you prefer (considering half of them is male, and half of the is female) :

fetching 1 million users from the DB ?
or only fetching 500k users from the DB ?

I suppose you will answer saying you prefer to fetch only half the users ;-) And, depending on the condition, if more complex, it could be even less than this.

Basically, fetching less data means :

less network used "for nothing" (i.e. to fetch data that will immediatly be discarded)
less memory used, especially on the PHP server
potentially less disk access on the MySQL server -- as there is less data to fetch from disk

In general cases, we try to avoid fetching more data that necessary ; i.e. we place the filtering on the database side.

Of course, this means you'll have to think about the indexes you'll place on your database's tables : they'll have to fit the needs of the queries you'll be doing.

Pascal MARTIN 2010-02-24 06:34:09

Note that you're not fetching less data - his question explicitly asks about fetching both sets of data; whether to do it all in one go or in two fetches.

Steven Schlansker 2010-02-24 08:30:49

That said, I totally agree that fetching less data is better in many ways ;-)

Steven Schlansker 2010-02-24 08:31:51

-1 the questioner wants ALL the data - his question doesn't ask what would be faster if he only wanted the men

Elemental 2010-02-24 12:06:42

Answer 3

+4 A:

I'd argue that there's really no reason to make your DB do the extra work of evaluating the WHERE clause. Given that you actually want all the records, you will have to do the work of fetching them. If you do a single SELECT from the table, it will retrieve them all in table-order and you can partition them yourself. If you SELECT WHERE male and SELECT WHERE female, you'll have to hit an index for each operation, and you'll lose some data locality.

For example, if your records on disk are alternating male-female and you have a dataset much larger than memory, you'll likely have to read the entire database twice if you do two separate queries, whereas a single SELECT for both will be a single table scan.

EDIT: Since I'm getting downmodded into oblivion, I decided to actually run the test. I've generated a table

CREATE TEMPORARY TABLE gender_test (some_data DOUBLE PRECISION, gender CHARACTER VARYING(20));

I generated some random data,

select gender, count(*) from gender_test group by gender;
gender | count
--------+----------
female | 12603133
male | 10465539
(2 rows)

First, let's run these tests without indices, in which case I'm quite sure I'm right...

test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='male';
QUERY PLAN

Seq Scan on gender_test (cost=0.00..468402.00 rows=96519 width=66) (actual time=0.030..4595.367 rows=10465539 loops=1)
Filter: ((gender)::text = 'male'::text)
Total runtime: 5150.263 ms

test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='female';
QUERY PLAN

Seq Scan on gender_test (cost=0.00..468402.00 rows=96519 width=66) (actual time=0.029..4751.219 rows=12603133 loops=1) Filter: ((gender)::text = 'female'::text)
Total runtime: 5418.891 ms

test=> EXPLAIN ANALYSE SELECT * FROM gender_test;
QUERY PLAN

Seq Scan on gender_test (cost=0.00..420142.40 rows=19303840 width=66) (actual time=0.021..3326.164 rows=23068672 loops=1)
Total runtime: 4543.393 ms (2 rows)

Funny, looks like fetching the data in a table scan without the filter is indeed faster! In fact, more than twice as fast! (5150 + 5418 > 4543) Much like I predicted! :-p

Now, let's make an index and see if it changes the results...

CREATE INDEX test_index ON gender_test(gender);

Now to rerun the same queries...

test=> EXPLAIN ANALYSE SELECT FROM gender_test WHERE gender='male';
QUERY PLAN

Bitmap Heap Scan on gender_test (cost=2164.69..195922.27 rows=115343 width=66) (actual time=2008.877..4388.348 rows=10465539 loops=1)
Recheck Cond: ((gender)::text = 'male'::text)
-> Bitmap Index Scan on test_index (cost=0.00..2135.85 rows=115343 width=0) (actual time=2006.047..2006.047 rows=10465539 loops=1)
Index Cond: ((gender)::text = 'male'::text)
Total runtime: 4941.64 ms

test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='female';
QUERY PLAN

Bitmap Heap Scan on gender_test (cost=2164.69..195922.27 rows=115343 width=66) (actual time=1915.385..4269.933 rows=12603133 loops=1)
Recheck Cond: ((gender)::text = 'female'::text)
-> Bitmap Index Scan on test_index (cost=0.00..2135.85 rows=115343 width=0) (actual time=1912.587..1912.587 rows=12603133 loops=1)
Index Cond: ((gender)::text = 'female'::text)
Total runtime: 4931.555 ms (5 rows)

test=> EXPLAIN ANALYSE SELECT * FROM gender_test;
QUERY PLAN

Seq Scan on gender_test (cost=0.00..457790.72 rows=23068672 width=66) (actual time=0.021..3304.836 rows=23068672 loops=1)
Total runtime: 4523.754 ms

Funny.... scanning the entire table in one go is still twice as fast! (4941 + 4931 vs 4523)

NOTE There's all sorts of ways this is unscientific. I'm running with 16GB of RAM, so the entire dataset fits into memory. Postgres isn't configured to use nearly that much, but disk cache still helps... I'd hypothesize (but can't be assed to actually try) that the effects only get worse once you hit disk. I only tried the default btree Postgres indexing. I'm assuming the PHP partitioning takes no time - not true, but probably a pretty reasonable approximation.

All tests run on a Mac Pro 8-way 2.66 Xeon 16GB RAID-0 7200rpm

Also, this dataset is 26 million rows, which is probably a bit larger than most people care about...

Obviously, raw speed isn't the only thing you care about. In many (most?) applications, you'd care more about the logical "correctness" of fetching them separately. But, when it comes down to your boss saying "we need this to go faster" this will (apparently) give you a 2x speedup. The OP explicitly asked about efficiency. Happy?

Steven Schlansker 2010-02-24 06:38:01

That's way out of best practices.

Nirmal 2010-02-24 06:52:25

Why? What, specifically, do you disagree with?

Steven Schlansker 2010-02-24 06:59:07

Your answer demonstrates a _fundamental_ misunderstanding of how RDBMS' work, as well as development best practices. To readers of this answer, treat it as _what not to do_.

hobodave 2010-02-24 07:00:42

You still haven't said anything specific.

Steven Schlansker 2010-02-24 07:10:57

Please reread my edited answer.

Steven Schlansker 2010-02-24 07:36:56

Your answer only considers efficiency from the perspective of the DB. Looping through the entire result set multiple times in the code is much more work than doing multiple full table scans on the DB. The DB is designed to be more efficient for this sort of activity.

ecounysis 2010-02-24 07:49:20

You're focusing on an exceptional case. You're ignoring the OP's confession: "this is just a simple illustration of the problem ... the real project [will have more] tables to query and more filter options". The question being asked is whether you filter your resultsets using WHERE or app code, and the answer is WHERE. It's _clear_ to me that the example provided by the OP is just that, an example.

hobodave 2010-02-24 08:01:59

Your modified answer, again, focuses on the exceptional. The OP confesses this is a "simple example". In production applications you would not be using `SELECT *` but rather `SELECT ...`, and you would have covering indexes when needed and possible. Querying data from a covering index using InnoDB will not "scan the entire database twice".

hobodave 2010-02-24 08:06:10

You state: "For example, if your records on disk are alternating male-female and you have a dataset much larger than memory". So, his dataset is too large to fit into memory, yet it magically fits into memory when the entire table is read into his application code? This makes no sense.

hobodave 2010-02-24 08:07:37

@Eric - you're not looping multiple times over the data set in the frontend; you're only going over it once to determine whether it's male or female. Basically, you've got a step where you take the SQL row and copy it into your list. Instead, you choose one of two lists. Still only goes over the list once...

Steven Schlansker 2010-02-24 08:09:18

@hobodave - With a small enough dataset, anything is okay to do. You can cartesian product fifteen tables if you want, and it won't matter. Asking performance questions is usually only interesting on larger datasets... As for fitting in to memory - it's entirely possible that you either don't allocate all your memory to the DB caches or that you're streaming results through.I'm not claiming that you should always consider things like this, just that if you're going to ask a performance question it's worth considering what happens with a large dataset. Much like O(n) notation...

Steven Schlansker 2010-02-24 08:12:46

I'm somewhat amused that you think that considering read and write load is essential with trivial datasets (say 0-5 rows, for example) but not when you've got 25million rows as I did in my testing...

Steven Schlansker 2010-02-24 08:19:12

"I'm assuming the PHP partitioning takes no time - not true, but probably a pretty reasonable approximation." - On _what planet_ does looping over 26 million items and making 52 million comparisons using a _scripting language_ take approximately "no time".

hobodave 2010-02-24 08:21:11

Because you're going to have to loop over them anyway, to load them into whatever datastructure you're using? The example code already has a while loop to copy the results out of the SQL result set. Adding a decision as to which list to put it in is probably inconsequential compared to whatever processing you're doing on it later.

Steven Schlansker 2010-02-24 08:23:22

"I'm somewhat amused..." - I'm not going over _all_ of the dozens of inaccuracies and outright fallacies you are presenting, just some. You're making gross generalizations left and right that defy logic. "anything is ok" is never true.

hobodave 2010-02-24 08:24:08

Of course it's not. I'm being snarky, because you're throwing out useless and unsubstantiated claims. Fundamental misunderstanding of how RDBMSs work? So much fail in my answer? The data shows that I'm right in at least some interesting cases, which means it's worth noting and discussing, and chucking such empty ad hominem attacks doesn't really accomplish anything.

Steven Schlansker 2010-02-24 08:28:44

On a server of equivalent specs as yours 52 million comparisons takes 10.8 seconds. So, swap 4.5 seconds of DB time for 10.8 seconds of script time? Yea, that's trivial.

hobodave 2010-02-24 08:29:55

Interesting. On mine, it's going at about 2.5 seconds for that many comparisons, making it still a win. Obviously if you're dealing with that many records though you'll probably consider switching to a faster language, perhaps Java, which will make it less of an issue...

Steven Schlansker 2010-02-24 08:37:18

Just goes to show it's always worth doing the numbers out if you actually care about how fast things run.

Steven Schlansker 2010-02-24 08:37:39

What would you do in future, let's say they added a `Prefer not to say` gender to the options, but still want to retreive only the male and female rows? Go back to the application and shift the business logic handling back to database? Speed advantages or future-proof design, the developer has to decide.

Nirmal 2010-02-24 09:27:50

I duplicated your tests using MySQL with InnODB. The results are significantly different, and do not support your stance. See my answer.

hobodave 2010-02-24 10:32:55

@Nimal: Either way, you're going to have to modify something. The decision to future-proof your code isn't something isn't something that I'd care to address at the moment.

Steven Schlansker 2010-02-24 17:50:27

@hobodave - I don't have a MySQL install, and don't know its internals very well. I can't readily explain why it performs so differently from Postgres. Interesting results, though.

Steven Schlansker 2010-02-24 17:51:22

Modifying something is entirely different from modifying everything. I would prefer to add another gender in the database and move on with the new developments, instead of going back to an old page and reinventing methods. You must care to address all problems. Speaking just about DB and network performance is something to do when you write a book.

Nirmal 2010-02-25 03:35:25

ansaurus

tags:

views:

answers:

Is it better to filter a resultset using a WHERE clause or using application code?

Update

Schema

Unindexed Results

Indexed Results

related questions