ansaurus

Question

SQL design approach for searching a table with an unlimited number of bit fields

Answer 1

+2 A:

I suggest you go with the second approach, known as Entity-attribute-value model. It's probably the only approach that will scale as you need.

You could also have two searches, the basic and the advanced. You keep the attributes for the basic search in one table, and all the advanced attributes in the other table. This way at least the basic search will remain rapid as the number of attributes will grow with time.

Developer Art 2009-10-01 15:56:22

@Tom: Otherwise, you'll be back to ask us questions like this: http://stackoverflow.com/questions/1494535/parsing-a-dash-delimited-value-in-a-sql-table-and-doing-lookups

OMG Ponies 2009-10-01 16:01:14

I just did some quick research on EAV and it has a fairly bad reputation from what I found over the past 20 minutes.. The queries are brutal, 20,30 inner joins....

Tom DeMille 2009-10-01 16:22:26

@Tom, the joins on those queries are typically due to a misunderstanding of how to write the sql.

Chris Lively 2009-10-01 16:35:38

so how would one write a query to select properties where hasparking = true, hasdeck = true, allowspets = false using EAV approach?

Tom DeMille 2009-10-01 16:39:39

@Tom, to clarify, when using an EAV model your search queries should dynamically create pivot tables to run regular select's against. This eliminates all of the joins. If necessary, you can store the dynamically created pivot table code as a view. When a new attribute is added, just invalidate and recreate the view.

Chris Lively 2009-10-01 16:40:34

this seems really hopeful, does anyone have any links to articles on creating pivot tables from EAV, I have been searching and can't find anything particularly helpful

Tom DeMille 2009-10-01 16:58:47

Tom - I give an example of how to handle queries in the nested EAV approach below. PIVOTs are one way around this but keep in mind that PIVOTs are pretty expensive as well compared with normal queries. As I mention, I still use them in certain circumstances but, having been there, I use approach #1.

Mark Brittingham 2009-10-01 17:51:39

Also, I strongly disagree with the assertion that an EAV model is the only one that will "scale as you need." This makes no sense at all to me given that subqueries, refining queries and pivots are all much more expensive than simple record searches. BTW - if you doubt my experience with PIVOTS, please see http://stackoverflow.com/questions/651936/how-can-i-improve-this-mailing-address-sql-server-select-statement/652065#652065

Mark Brittingham 2009-10-01 17:53:25

Mark = seems you are right with the speed, see my comments below, dynamic sql with EXEC is a lot faster (with all the rows in one table)

Tom DeMille 2009-10-02 20:32:17

Answer 2

A:

Create a table that stores attributes or search columns based on the apartment. Definitely do not keep adding more bit field columns..nightmare maintenance and nightmare coding. And definitely don't please don't dynamically generate where statements and use exec.

JonH 2009-10-01 15:58:05

Answer 3

A:

I've never tested this, but what if you were to create a varchar(256) fields that stored all of your flags as one long string of 0's and 1's.

For example,

AllowsPets = 1
HasParking = 0
HasDeck = 1
ModernKitchen = 1

would be:

PropertyFlags = 1011

and if you were looking for something that AllowsPets and HasDeck, then the search query would look something like this:

WHERE PropertyFlags LIKE '1_1_' (the underscore represents a single wildcard character in the like clause)

this would solve your issues with adding additional columns to the search in the future, but I'm not sure how this would do performance-wise.

has anyone out there tried anything similar to this?

2009-10-01 16:32:57

The problem here is that it's pretty easy to lose the relationship between the property flag and the positioning of the value. Also, it makes extending it for each client a nightmare.

Chris Lively 2009-10-01 16:38:25

Answer 4

A:

I've walked down this path a few times trying to store health status markers!

When I first started (in 2000?) I tried a character position approach (your #2) and found that it quickly became pretty unwieldy as I wrestled with the same questions over and over: "which position held 'Allows Pets' again?" or, worse yet, "how long is this string now? / which position am I on?" Can you work around this problem - developing objects to manage things for you? Well, yes, to an extent. But I really didn't appreciate how much extra work it cost compared to having the field identities managed for me by the database.

The second time around, I used an attribute/value pair approach similar to your solution #3. This basically worked and, for specialty needs, I still generate attribute/value pairs using a PIVOT. Also, my background is in AI and we used attribute/value pairs all the time in mechanical theorem proving so this was very natural for me.

However, there is a huge problem with this approach: pulling any one fact out ("Show me the apartments that allow pets") is easy but pulling all of the records meeting multiple constraints quickly gets very, very ugly (see my example below).

SO...I ended up adding fields to a table. I understand the theoretical reasons that Jon and 'Unknown' and 'New In Town' give for preferring other approaches and I'd have agreed with either or both at one point. But experience is a pretty harsh teacher...

A Couple More Things

First, I disagree that adding more bit fields is a nightmare of maintenance - at least compared with a character-bit approach (your #2). That is, having a distinct field for each attribute ensures that there is no 'management' necessary to figure out which slot belongs to which attribute.

Second, having 300 fields isn't really the problem - any decent database can do that without problem.

Third, your real issue and the source of pain is really the matter of dynamically generating your queries. If you are like me, this question is really all about "Do I really have to have this massive, grody and inelegant chain of "IF" statements to construct a query?"

The answer, unfortunately, is Yes. All three of the approaches you suggest will still boil down to a chain of IF statements.

In a database bit-field approach, you'll end up with a series of IF statements where all of your columns have to be added like so:

string SQL = "Select X,Y,Z Where ";

if (AllowsPets == 0)
  SQL += "(AllowsPets = 0) AND ";
else if (AllowsPets == )
  SQL += "(AllowsPets = 1) AND ";  // Else AllowsPets not in query
.
.
.
SQL = SQL.Substring(SQL.Length - 4);  // Get rid of trailing 'AND' / alternatively append '(1=1)'

In a character-position approach, you'll do the same thing but your "Appends" will add "0", "1" or "_" to your SQL. You'll also, of course, run into the maintenance issues deciding which one is which that I discussed above (enums help but don't completely solve the problem).

As mentioned above, the Attribute-Value approach is actually the worst. You'll have to either create a nasty chain of sub-queries (which surely will cause a stack overflow of some sort with 300 clauses) or you need to have an IF-THEN like this:

// Kill any previously stored selections.
SQLObject.Execute("Delete From SelectedApts Where SessionKey=X");
// Start with your first *known* attr/value and fill a table with the results.
.
.
Logic to pick first known attr/value pair
.
.
SQLObject.Execute("Insert Into SelectedApts Select X as SessionKey, AptID From AttrValue Where AllowsPets=1");

// Now you have the widest set that meets your criteria. Time to whittle it down.
if (HasParking == 1)
  SQLObject.Execute("Delete From SelectedApts Where AptID not in (Select AptID From AttrValue Where AllowsChildren=1));
if (AllowsChildren == 0)
  SQLObject.Execute("Delete From SelectedApts Where AptID not in (Select AptID From AttrValue Where AllowsChildren=0));
.
.
.
// Perform 2-300 more queries to keep whittling down your set to the actual match.

Now, you may be able to optimize this a bit so you run fewer queries (a PIVOT, sets of subqueries or using the UNION operator) but the fact is that this gets VERY expensive compared to the single query that you can use (but have to build) using the other approaches.

Thus, this is a painful kind of problem no matter what approach you take - there really is no magic that helps you to avoid it. But, having been there before, I would absolutely recommend approach #1.

Update: If you are really focused on pulling straight criteria matches ("All Apartments That Have A, B and C") and don't need other queries (like "...Sum(AllowsPets), Sum(AllowsChildren)..." or "...(AllowsPets=1) OR (AllowsChildren=1)...") then I really like KM's answer the more I look at it. It is very clever and looks likely to be acceptably fast.

Mark Brittingham 2009-10-01 17:44:58

thanks for your lengthy, helpful and well informed answer Mark. I have been following the eav/pivot trail all afternoon and the more I look into this the more I start to feel like adding 100 specific bit fields and 100 more if statements in my one query (we only have one search query) is the way to go

Tom DeMille 2009-10-01 18:13:29

lol - boy, I just wish someone had told me that back in 2000! As I say, though, this is a kind of ugly problem. And whenever good developers are faced with an ugly problem, they cannot help but think that there has to be a better way. So your search is quite understandable. At this point, most of the internals are hidden away in various classes that I've written but the pain of going through it stays with me!

Mark Brittingham 2009-10-01 18:49:54

@Mark Brittingham, what do you think about my answer? I'm certainly no fan of EAV table design, not in the least. I just thought I'd give it a shot.

KM 2009-10-01 19:05:02

Tom - see KM's answer though. It is pretty ingenious and gets around my objections to an attribute/value pair. I still cannot use it as I need to report statistics across record sets and rather than finding matching sets to a specific criteria set. However, it may work for you. Of course, my advice above may still hold - you might still find it simplest to create a table with many fields. At least SO has given you a lot to think about!

Mark Brittingham 2009-10-01 19:08:55

KM - see my comments in your answer. I just wish that I'd come up with it! As I wrote Tom, though, I'm not specifically after "Objects that have Attribute X" in my project. So...I'll cut myself just enough slack to say that I was not tuned in to that specific application. Still, I admire what you've done from a purely technical perspective.

Mark Brittingham 2009-10-01 19:11:19

@Mark, I think you ignore the potential in using views to access this data in combination with a pivot table. Also, you don't need a chain of IF statements. You can certainly go that route, but it's not necessary.

Chris Lively 2009-10-01 20:34:08

Chris, are you saying that if you lay the table out with a pivot you don't need a chain of IF statements to query it? How would you do this, then, if you want to leave out unspecified criteria? That is, if you really only want to search on three criteria, I'd think that you'd dynamically build the query (using a chain of IF statements in C#, for example, or some TSQL) rather than arbitrarily creating a long query with many attributes that don't apply (e.g. Where...((@Param="NA") or (AllowsChildren=@Param))...). I may not quite understand why you think the Pivot takes care of the IF?

Mark Brittingham 2009-10-01 20:50:53

Chris, still thinking about your question...if you are saying that you don't need the If statement I show above under the discussion of an EAV, you are right (you don't need to whittle the set down by hand - the Pivot/View takes care of creating a record-oriented view). However, you are THEN back at the need to query the table. THAT is where you'll then need the IF - to create the query on the view.

Mark Brittingham 2009-10-01 21:05:03

@Mark, the "Where...((@Param is null) or (AllowsChildren=@Param))...)" is actually quite effective and easy to append to without needing IF statements. Further, it can be put into a stored proc if you so desire. Also, I'm actually thinking CROSS TAB, not pivot (my mistake in previous comment). The pivot construct is too dense and under performing compared to cross tab. The cross tab view would be generated at the time the attribute list changed. Which would allow a search screen to be dynamically generated based on the client settings.

Chris Lively 2009-10-01 21:25:56

Chris...yes, I like the fact that you can test the parameter to take it out of the query although I can't help but think that it'd be more efficient not to have it. Since it doesn't enter into the actual search maybe it is no big deal - sounds like some performance testing time to me!

Mark Brittingham 2009-10-01 22:29:18

Chris (continued) Anyway, I have to admit to being unfamiliar with Cross Tabs in SQL Server. Do you have an example? Maybe enter a new answer on this thread to illustrate how it would apply. I don't really care about votes one way or another (hell, I'll vote for your answer even if it doesn't really apply just to see you illustrate the idea). I just enjoy the back and forth on SQL.

Mark Brittingham 2009-10-01 22:30:24

so as to being acceptably fast, it really isn't, we can't go backwards in speed.Anyone know if using CLR .dll based queries is faster than a normal sp?

Tom DeMille 2009-10-02 20:58:05

Answer 5

+8 A:

something like this may work for you:

define tables:

CREATE TABLE #Apartments
(
     ApartmentID    int          not null primary key identity(1,1)
    ,ApartmentName  varchar(500) not null
    ,Status         char(1)      not null default ('A') 
    --....
)

CREATE TABLE #AttributeTypes
(
    AttributeType         smallint     not null primary key
    ,AttributeDescription varchar(500) not null
)

CREATE TABLE #Attributes  --boolean attributes, if row exists apartment has this attribute 
(
     ApartmentID     int not null --FK to Apartments.ApartmentID    
    ,AttributeID     int not null primary key identity(1,1)
    ,AttributeType   smallint  not null --fk to AttributeTypes
)

insert sample data:

SET NO COUNT ON
INSERT INTO #Apartments VALUES ('one','A')
INSERT INTO #Apartments VALUES ('two','A')
INSERT INTO #Apartments VALUES ('three','I')
INSERT INTO #Apartments VALUES ('four','I')

INSERT INTO #AttributeTypes VALUES (1,'dishwasher')
INSERT INTO #AttributeTypes VALUES (2,'deck')
INSERT INTO #AttributeTypes VALUES (3,'pool')
INSERT INTO #AttributeTypes VALUES (4,'pets allowed')
INSERT INTO #AttributeTypes VALUES (5,'washer/dryer')
INSERT INTO #AttributeTypes VALUES (6,'Pets Alowed')
INSERT INTO #AttributeTypes VALUES (7,'No Pets')

INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (1,1)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (1,2)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (1,3)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (1,4)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (1,5)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (1,6)

INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (2,1)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (2,2)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (2,3)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (2,4)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (2,7)

INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (3,1)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (3,2)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (3,3)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (3,4)

INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (4,1)
INSERT INTO #Attributes (ApartmentID, AttributeType) VALUES (4,2)
SET NOCOUNT OFF

sample search query:

;WITH GetMatchingAttributes AS
(
SELECT
    ApartmentID,COUNT(AttributeID) AS CountOfMatches
    FROM #Attributes
    WHERE AttributeType IN (1,2,3)  --<<change dynamically or split a CSV string and join in
    GROUP BY ApartmentID
    HAVING COUNT(AttributeID)=3--<<change dynamically or split a CSV string and use COUNT(*) from resulting table
)
SELECT
    a.*
    FROM #Apartments                      a
        INNER JOIN GetMatchingAttributes m ON a.ApartmentID=m.ApartmentID
    WHERE a.Status='A'
    ORDER BY m.CountOfMatches DESC

OUTPUT:

ApartmentID ApartmentName 
----------- --------------
1           one           
2           two           

(2 row(s) affected)

In the search query above, I just included a CSV string of atribute IDs to search for. In reality, you could create a Search stored procedure where you pass in a CSV parameter containing the IDs to search on. You can look at this answer to learn about loop free splitting of that CSV strings into table which you can join to. This would result in not needing to use any dynamic SQL.

EDIT based on the many comments:

if you add a few columns to the #AttributeTypes table you could dynamically build the search page. Here are a few suggestions:

Status: "A"ctive "I"nactive
ListOrder: can use this to sort by to build the screen
ColumnNumber: can help organize fields on the same screen row
AttributeGroupID: to group fields, see below
etc.

You could make all the fields checkboxes, or add another table called #AttributesGroups, and group some together and use radio buttons. For example, since "Pets Allowed" and "No Pets" are exclusive, add a row in the #AttributesGroups table "Pets". The application would group the attributes in the interface. Attributes in Groups would work the same as regular ungrouped attributes, just collect the selected IDs and pass it in to the search procedure. However, for each group you'll need to have the application include a "no preference" radio button and default it on. This option will not have an attribute ID and is not passed in, since you don't want to consider the attribute.

In my example, I do show an example of a "super attribute" that is in the #Apartments table, "Status". You should only consider major attributes for this table. If you start using these, you may want to alter the CTE to be FROM #Apartments with filtering on these fields and then join to #Attributes. However you will run into issues of Dynamic Search Conditions, so read this article by Erland Sommarskog.

EDIT on latest comments:

here is code to have a list of exclude attributes:

;WITH GetMatchingAttributes AS
(
SELECT
    ApartmentID,COUNT(AttributeID) AS CountOfMatches
    FROM #Attributes
    WHERE AttributeType IN (1,2,3)  --<<change dynamically or split an include CSV string and join in
    GROUP BY ApartmentID
    HAVING COUNT(AttributeID)=3--<<change dynamically or split a CSV string and use COUNT(*) from resulting include table
)
, SomeRemoved AS
(
SELECT
    m.ApartmentID
    FROM GetMatchingAttributes      m
        LEFT OUTER JOIN #Attributes a ON m.ApartmentID=a.ApartmentID 
            AND a.AttributeType IN (5,6)   --<<change dynamically or split an exclude CSV string and join in
    WHERE a.ApartmentID IS NULL
)
SELECT
    a.*
    FROM #Apartments           a
        INNER JOIN SomeRemoved m ON a.ApartmentID=m.ApartmentID
    WHERE a.Status='A'

I don't think I would go this way though. I'd go with the approach I outlined in my previous EDIT above. When include/exclude of an attribute is necessary, I'd just add an attribute for each: "Pets allowed" and "No Pets".

I updated the sample data from the original post to show this.

Run the original query with:

(..,..,6,..) to find apartments that allow pets
(..,..,7,..) to find apartments where no pets are allowed
(..,..,..) if there is no preference.

I think this is the better approach. When combined with the grouping idea and dynamically built search page described in the last edit, I think this would be better and would run faster.

KM 2009-10-01 18:41:25

+1 - This is very ingenious. I hadn't thought of doing the count across matches and then finding those with all X matches. It won't work for my problem but it may for Tom. One caveat though - it seems to me that you'll have to store both Attributes and "Not-Attributes" if you want to handle a case where some attribute will rule out an apartment (e.g. someone who does NOT want a place where pets are allowed). Or you'll have to change the GetMatchingAttributes so that both IN and NOT IN clauses are included. Good work, though!

Mark Brittingham 2009-10-01 19:00:38

Also, TOM might want to consider having core attributes (those often requested) in his apartments table and then use the Attributes table just for the more obscure items to optimize performance.

Mark Brittingham 2009-10-01 19:01:49

this looks awesome, thanks for the input, I'll post more after I check it out

Tom DeMille 2009-10-01 19:46:05

this is awesome! I think the loop free splitting function from the linked in post is perhaps the most useful SQL utility I have ever seen! Thanks again!

Tom DeMille 2009-10-01 20:14:21

I just wanted to throw one thing in here about this. SQL 2008 supports table-value parameters. This means you can pass the entire search criteria in as a table, and simply do a join on that table to accomplish the same goal. Which is faster because it doesn't rely on a parsing function call to work.

Chris Lively 2009-10-01 22:19:58

@Chris Lively, table variables are a good option if on SQL Server 2008 and using a client application language that supports passing them in. However, OP never mentions which they are using.

KM 2009-10-02 13:29:12

How could I change the query to exclude attribute as well?something likeWHERE AttributeType IN (select * from dbo.FN_ListToTable(',',@i)) and AttributeType Not IN (select * from dbo.FN_ListToTable(',',@o)) above doesn't seem to work though, the result set gets larger when adding the 'not in' clause

Tom DeMille 2009-10-02 14:24:27

I have been doing some testing on the speed of this approach. As slick as this approach is, it seems to be a lot slower than just building a select statement and executing it dynamically with all the rows in one table... a lot slower actually. Using the same query and running it in a loop 200 times, the 'EXEC(@SQL)' method returned the count of listings (402 in this case) 200 times in 4 seconds, the attribute based approach returne the same 200 result count in 28 seconds, so in my limited testing the old school dynamic sql approach is 7 times faster. (indexes in place for both approaches)

Tom DeMille 2009-10-02 20:21:15

update on this, by moving some 'super' attributes from the GetMatchingAttributes query to the apartments query am able to speed things up, still slower though than dynamic sql though by about 50%

Tom DeMille 2009-10-02 20:29:03

@Tom, is that using the 1st query with only include attributes, or the second query using include+exclude attributes?

KM 2009-10-02 21:14:30

KM - just the includes, first query

Tom DeMille 2009-10-02 21:19:12

the funny thing is the query I compared it to does about 10 other joins as well, which I'm not doing in the attribute approach. I'm really suprised by the results

Tom DeMille 2009-10-02 21:19:54

run SET SHOWPLAN_ALL ON and then this query, edit the question with the output in it, and the query you are running.

KM 2009-10-02 21:32:54

ansaurus

tags:

views:

answers:

SQL design approach for searching a table with an unlimited number of bit fields

related questions