tags:

views:

159

answers:

2

I'm looking to build a powerful search feature for a site, similar to NewEgg's drilldown search, e.g.,

http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=2010150014%201035507776&name=7200%20RPM

I'm working with a variety of objects similar to products which have different criteria. Can anyone recommend a good design for building a search engine like NewEgg's?

+2  A: 

Storing the data "vertically", i.e. in an Entity-Attribute-Value (EAV) format, along with the [meta]data-driven schema management implicit to EAV, provide a framework where each product's attributes are "independent" from from one another. This, in turn, facilitates the implementation of drill-down (i.e. guided refinement of the query, where at each step the end-user is supplied with the list of possible attributes still applicable, for each such attribute the list of possible values).

A small caveat is that this is better applicable to smaller catalogs (say fewer than 1 Million products), for the EAV model can introduce some performance and/or scaling issues with bigger databases. The actual size at which performance is a concern varies with the specifics of the catalog (average number of attributes per product, commonality of attributes between products of a different type, general complexity of the "ontology" etc.), but EAV is quite the way to go for smaller catalogs. In addition to its support for the "drill down" filtering describes it introduces flexible data schema (ability to add/remove attributes and/or product types etc., without requiring a change of the physical (database) schema; only the logical schema is altered).

Edit: more detail/resources on EAV
Admittedly, the Wikipedia article about it is somewhat abstract...
In a nutshell, the model identify the following concepts:

  • Entity (aka a product, or an item) = a "record" in the traditional relational lingo
  • Attribute = a "column" (aka a "field") in RDBMS lingo
  • Value = the numeric (or string or otherwise) value of a given column for a given record.
  • Type (aka category) = [loosely] a "table" in RDBMS, i.e. a set of records which share, generally, the same set of attributes.

To illustrate this with, say, an electronics goods catalog, an entity could be a particular "Flat Screen Monitor", its Type could be "Display Devices", its Attributes "Size", "Resolution", "Price" etc.

With EAV, the bulk of the information is stored in two tables called say the Product table, and the ProductAttributes table:

Product table  
   "ProductID" (primary key, the "EntityId")
    "TypeId" 
    optionally, some common attributes found in all/most other Products, say...
      price
      ManufacturerId
      Photo

ProductAttributes table
    "ProductID" (Foreign Key to Product table)
    "AttributeID"  (FK to Attribute table)
    "Value"   (actual value; note: sometimes we can have several SQL fields for
               this say IntValue, StringValue, DateValue, allowing to store 
               values in their natural format)

The tables above constitute the bulk of the data, and it is complemented by tables storing the [logical] schema of the catalog, also known as the "metadata". These tables include:

  • Attributes table where the attributes are defined: Name, datatype, isRequired and such.
  • Types Table where the types (categories) are defined: Name, possibly parent type in the case of hierarchical ontologies.
  • Type_Attributes where the possible attributes for a given type are listed (eg: a "TV Set" has attribute "Number of channels", "Screen Size" etc. and a "VCR" has attribute "Number of Heads", "Format supported", "Body color" etc.

All this may seem somewhat complicated, compared with the traditional approach whereby the logical schema is "hardcoded" within the SQL schema, i.e. we have one "TVSets" Table with its set of columns one per attribute, and then a "VCR" table with its own, different set of columns/attributes. However with such an approach the application logic ends up hard-coding in some fashion (if only through an indirection in a map of sorts) the table and column names.
In contrast, the EAV model, allows the program to discover the list of possible types, and, for each type the list of possible attributes (either required or optional). Also, since the attribute values are all stored in the same table, it is possible to filter on attributes irrespective of the type (or sub-type) of the product. For example to get all items cheaper than 50 dollars (in the other approach we may have had to look in dozen of tables for that).

Back to the "drill-down" feature...
Once an initial search is performed (say searching all products where name [full-text indexed] contains the word "screen"), the ProductAttributes table can produce the distinct list of all different AttributeID (hence attribute name by lookup in Attributes table) for product satisfying this first search criteria.
Upon the user selecting a given attribute (say "Manufacturer", the ProductAttributes table can produce the distinct list of manufacturers (along with the number of products for each manufacturer). (alternatively such info can be searched initially rather than lazily, when the the user requests it).
The user then selects a given Manufacturer (or several), and a new query is ran to reduce the initial results list. The list of possible Attributes (and within each attributes the list of possible values) decreases, since some products (entities) initially selected are now excluded.
The process continues, providing the end user with guided search into the catalog. Of course the user may backtrack etc.

To maybe help with this wordy explanation (or maybe to further confuse the reader...) the following snippet provides a more precise indication of way this structure can be used to implement searches. This code is adapted for the table names used in the explanation above and may include a few typos, but generally provide the flavor of things. Also, this is written with a Common Table Expression (CTE) but could well be written as a subquery. Also not the that we do not join with the logical schema (meta data) tables, but that could be done too, to get the attribute names, type name and such, directly in the resultset.
As hinted earlier the queries and logic supporting this architecture are more complicated but also more versatile and tolerant of changes in the type of items stored and their attributes. Of course, this type of query is generated dynamically, based on the current list of search criteria supplied by the end-user.

WITH SearchQry AS (
  SELECT ROW_NUMBER() OVER (ORDER BY P.EntityId ASC) AS RowNum,  
         P.EntityId AS EId
         FROM  Products P
         INNER JOIN ProductAttributes PA1 ON P.EntitityId = PA1.EntityId and PA1.AttributeID = <some attribute id, say for Manufacturer> 
         INNER JOIN ProductAttributes PA2 ON P.EntitityId = PA2.EntityId and PA2.AttributeID = <some other attribute id, say for Color>
         -- here for additional PAn JOINs as more criteria is added
         WHERE  P.ProductType IN (ProdId_x, ProdId_y, ProdId_z)  -- for example where these x,y,z Ids correspond to say "TV Sets", "LapTop Computers" and "PDAs" respectively
            AND PA1.Value = 'SAMSUNG' -- for example
            AND PA2.Value = 'YELLOW' -- for example
         GROUP BY P.EntityId
   )  

SELECT  P.EntityId, PA.AttributeId, PA.Value -- PA.IntValue (if so structured)
FROM (SELECT * FROM SearchQry WHERE RowNum BETWEEN  1 AND  15)  AS S
JOIN ProductAttributes PA ON PA.EntityId = S.EId
INNER JOIN Products P on P.EntityID = PA.EntityId
ORDER BY P.EntityId, P.AttributeId  -- or some other sort order

Sorry for the long explanation, there's maybe [probably] a better description of this online, but I haven't found it...

mjv
Can you link me to any resources on EAV?
Kirk
Love it, thanks for the detailed description of the system! This is extremely helpful for my project.
Kirk
@Kirk: glad to help. See added snippet providing a "flavor" for this approach. I wish I could find some online reference with this type of practical explanation of the EAV model, applied to store catalogs; I'm probably not looking in the right places ;-)
mjv
A: 

Instead of using a relational database, use a standalone full-text search engine for drilldowns or faceted search, "Solr" for example. Sooner or later "grouping" will be a performance issue when you querying the database.

Worth to take a look

Solr Search Server

dahuda