views:

79

answers:

7

I have a structure of about 10 tables. This structure works great for data entry. However, I also need to preform complex and fast searches on that dataset. There are three approaches to this I can think of:

  1. Join all those tables in select. This is pretty slow and likely not a good approach. If it's relevant, database is Informix; I've looked into creating views hoping that they'd be more optimized, but testing shows that selects on views are even slower than a lot of joins. Maybe there is some way to make Informix pre-join tables and create indexes on those, but from what I've seen it's not likely. I've done some preliminary testing and it seems that view is even slower than joins, but maybe I'm missing some Informix options. Both joins and view are slower than the approach #2:

  2. Single synthetic table which is updated periodically. This seems the right approach, especially since searches don't need to be on real-time data - actually, I can probably get away with updating synthetic table daily. Data size would be about 500k-1000k rows.

  3. Memcached and similiar in-memory solutions. At the moment there is no such infrastructure in place, and this probably doesn't warrant implementing it, however this is something I'll look at once the synthetic table becomes too slow. Also, there are lots of search parameters and even first query has to be fast, so this approach will have to eagerly cache all data. Of course, I'll probably cache anything I can even with approaches 1 and 2.

I'd like your thoughts on this. Is there a magic bullet I'm missing? What have you used in similar situations?

+1  A: 

You're right on track.

There's no magic bullet for this, because your tables are really spread out. What I've done in the past is do something like your Option 2.

Let's say I have a table of Accounts with an AccountID as the PK. I'd create another table called AccountSearch which would be related in a many-to-one relationship with Accounts. AccountSearch would contain a set of strings and their associated IDs.

If you wanted fuzzier searching, you could also manipulate the strings using NYIIS or Soundex (yuck) or simply removing whitespace. You could also implement full-text searching, but that is often overkill.

Account
-------
AccountID (PK)
Name
OwnerName

AccountSearch
-------------
SearchString (PK)
AccountID (PK)
Dave Markle
Thanks. In my case, search parameters are pretty well structured and none of parameters are text fields so I won't benefit much on precalculated or processed fields.
Domchi
+1  A: 

Option 2 is called a data mart or a data warehouse. For the cost of additional storage, you can have an operational database and a query database.

Since you say there are lots of search parameters, you can create your query tables using a star schema, based on the search parameters.

Gilbert Le Blanc
I kind of agree with Stephanie, my case is not as complex as typical data warehouse. But I agree with you that I could probably learn a trick or two from you guys who have more experience with them, so thanks for the search terms. :)
Domchi
+1  A: 

Option 1.

Depending on the volume of data in your tables, 10 tables should be able to be joined in a reasonable amount of time. How slow is too slow for you?

Here are the biggest two things you can do to make sure your queries are running smoothly.

First make sure your logical table design is really logical. Bad table design and bad column design are responsible for a large amount of unnecessary slowdowns in database apps. The fact that data entry is working well is a pretty strong indication that your table design is pretty good. Is your design normalized? Or somewhat normalized?

Second, create the right indexes. The right indexes can make a query run a hundred times faster, depending on the circumstances. In order to build the right indexes you need to know a little bit about how indexes work, about the query you are giving, about the volume of data, and about the strategy that the DBMS chooses when executing the query.

Option 2.

This may well be your best bet. Learn a little about data marts or data warehouses. That's how database people deal with design issues involving one schema for data entry, a different schema for queries, and a process to keep the two schemas in synch.

There are a number of design issues here, and rather than try to enumerate them, I'm just going to suggest that you bone up on data marts.

Walter Mitty
logical table design usually isn't done in a database, that's where you move FROM logical TO physical.
Stephanie Page
Steph, I'm of the old school, where conceptual modelling uses the ER model, logical modelling uses the relational model, and physical modelling uses the SQL model and DBMS specific features. Issues like normalization can be dealt with in a manner that's independent of which DBMS you are using, and whether or not your interface is SQL. I call these logical issues.
Walter Mitty
Ah, YES! thank you for that. I've been struggling with a solid, simple definition of Logical model. I've said that the only difference between a logical and conceptual is a logical migrates primary identifiers to children and converts many-many relations to a pair of one-many with a mapping table. Your explanation is more elegant. What other things do you do in a LM? Ok, this should be a new question...
Stephanie Page
Walter, those 10 tables are normalized, but the query looks at the data in a bit different way, which is why I might benefit from one denormalized table. For example, on entry, there is one package: 80 seats in a plane and 2 types of rooms, one type having 20 and other having 30 available rooms. But on search, if charter is sold out (since other package cross-sells the same charter flight), even if rooms are available, the package is officially sold out. If I opt for #2, I might be able to do some of the sorting/filtering when generating synthetic table - at least I hope so.
Domchi
Steph, I want to respond to your question, but not here. The short answer is that my practice is less elegant than my description of it.
Walter Mitty
OP, if you denormalize, follow a plan. Learn star schema to start with. A well designed denormalized schema will serve you much better than a haphazard one.
Walter Mitty
+2  A: 

In-memory databases take millisecond database access time and turn them into microsecond access time. Is this an automated trading system or 911 dispatch or aviation traffic control system? If not you'd be hard pressed to show a requirement for microsecond access times.

Walter has it correct when he said "how slow is too slow?" Define your requirements clearly, is this an internal or external SLA? Do you have requirements? or does this just 'feel' too slow.

Learn to read an execution plan and examine the plan for your slow query. Is there a cardinality estimate that way off? Does it anticipate 1 row when you know there are 100k rows? Is it doing a full table scan on a table you expect 1 row from?

If the query looks as efficient as it can be, trace it... see if you can identify if there are any time sinks that you're not expecting. Is it fine when done solo, but poor performing under load? Honestly, 10 tables with not a lot of data to begin with really shouldn't be super slow.

I think Gilbert is overestimating your problem. With no more than 1M records, a full dimensional model seems overkill. For the tone of your question it sounds like you're really just trying to speed up a query or three - not creating the start of an entire BI platform. If that's the case, look back at the explain plan, see if you can identify major amounts of work that could be reduced via the pre-calculation of some joins (denormalization), build that new materialize view... try the query, if no improvement then drop that and try something else... do not keep building on unsuccessful attempts.

Now I see the travel industry comment

So you have 2 classes of room, 30 doubles and 20 singles, and you have 80 seats on the plane. But the doubles can add an extra bed so you might run out of seats before you run out of rooms.

Rooms Remaining
---------------
5 Single Remain
10 Doubles Remain

Seats Remaining
---------------
8 Plane seats

Since there's one plane and 2 room types, you'll just Cartesian those together.

Package Type       Rooms      Seats      Packages Available
------------       ------     -----      ------------------
 Single              5           8             5
 Double              10          8             8

Notice the Packages available is a simple calculation LOWEST(Rooms, Seats)

In your comment you said

even if rooms are available, the package is officially sold out.

Package Type       Rooms      Seats      Packages Available
------------       ------     -----      ------------------
 Single              5           0             0
 Double              0           0             0

So here's that case... you've filled the double rooms and 5 of them are triples... so the plane is full and there are 5 extra single rooms. But our LOWEST calculation does the work to show there are no Single Packages available.

am I close?

Stephanie Page
I'll answer here - everything under 10 seconds is fast, and everything up to 25 seconds is tolerable. I'm still in design phase, and while I did some tests, since the system is pretty complex I probably don't have enough time to do both approaches. I'm worried that I'll find that #1 breaks under load if I go with it. Some rough tests suggested that while #2 query runs under a second, #1 can take 15-23 seconds, and that's with only one concurrent user. Sure, I could optimize it, and it might be fast enough for me, but it's still 15 to 23 times slower and I feel uneasy to bet on that approach.
Domchi
One more note - I have one query, and that query would have to join all 10 tables at once.
Domchi
You seem infatuated with the number of tables in the join... that's not a huge issue. The important aspects are the size of each intermittent results set. How many rows either from a table or an index have to be read in and plowed through before you can move on to the next step.
Stephanie Page
Domchi, Ah, I just saw the post you made to Walter... this is a travel industry DB. I used to work at America West Airlines and at IcelandAir so I'm doubly interested in your issue... don't worry I now work in the Utilities field so I won't steal your IP.
Stephanie Page
Stephanie, I'm not worried; if only IP in travel industry was as simple as this... you're on track with your edit, but the situation is a bit more complicated. Sometimes several packages (products) sell the same room or the same flight, and sometimes when the room is sold it's even not known in which hotel will the guest stay, only that it will be single with extra bed in one of three possible hotels. My SQL formulas are basically what you've said, but a lot more complicated. :) But that's is not relevant to my original issue so...
Domchi
A: 

In the past I have used an implementation simillar to #2. You could try creating a view which would basically consist of the searchable fields for each table eg.

SELECT Name From Person
UNION SELECT Name FROM Company

Then feed that view into a full text indexing product such as Sphinx which can optimize your searching and provide flexible options for weights, terms etc.. as well as scheduling how often your indexes are updated.

Rob
I'm searching structured data, not text. It's basically searching a database of airline flights (with seats), accommodations and accommodation packages with room quantities, and with several prices for each item (catalog price, first minute, last minute and so on). The searches are pretty specific, like - give me free rooms and packages for period from 2010-08-01 to 2010-08-08 for catalog Mediterranean and hotels with 3+ stars, for two adult persons and one 8-year-old child.
Domchi
+1  A: 

How often do you need search criteria on all the tables?

One way of working that may improve performance is to make sure the main query adapts to the search criteria, only joining the necessary tables, and retrieving just the primary key values from the main table. This data might be saved into a temporary table, or in a scroll cursor, or fetched back to the client.

Then, when you need to collect the information for display, you use a (prepared) SELECT that collects exactly the data you need for the row(s) you need.

The advantage of this is that (for many queries) you seldom specify conditions on all 10 tables, so you don't need to do the 10-way join while discovering the relevant records. And the single-row operation is all joins on keys so the lookups are indexed with no scanning.

Clearly, you can juggle the criteria; you might select all the data from the primary table, and all the relevant values from one of the secondary tables (which is always needed) but decide not to select any values from the other 8 tables (because they don't always appear in the search criteria), or other variants along these lines.

This assumes you can build dynamic SQL, but that is very seldom an issue.

Jonathan Leffler
A: 

Consolidate your 10 tables into one temporary table.. See: http://stackoverflow.com/questions/3057990/transfusion-should-i-denormalize-loan-purchase-inventory-and-sale-tables-into

Frank Computer
That's my #2. 10 tables, and data is periodically loaded from them to 1 synthetic table.
Domchi
Similar to #2 except that my consolidated transaction table is permanent, not temp..OK, so choosing the needed columns from each table into a temp table to support your queries should be the best method. Although the temp table will be large, queries will be simple [where col1 = value and col2 = value ...] because joins will be minimized or not used.
Frank Computer