Storing the data "vertically", i.e. in an Entity-Attribute-Value (EAV) format, along with the [meta]data-driven schema management implicit to EAV, provide a framework where each product's attributes are "independent" from from one another. This, in turn, facilitates the implementation of drill-down (i.e. guided refinement of the query, where at each step the end-user is supplied with the list of possible attributes still applicable, for each such attribute the list of possible values).
A small caveat is that this is better applicable to smaller catalogs (say fewer than 1 Million products), for the EAV model can introduce some performance and/or scaling issues with bigger databases. The actual size at which performance is a concern varies with the specifics of the catalog (average number of attributes per product, commonality of attributes between products of a different type, general complexity of the "ontology" etc.), but EAV is quite the way to go for smaller catalogs. In addition to its support for the "drill down" filtering describes it introduces flexible data schema (ability to add/remove attributes and/or product types etc., without requiring a change of the physical (database) schema; only the logical schema is altered).
Edit: more detail/resources on EAV
Admittedly, the Wikipedia article about it is somewhat abstract...
In a nutshell, the model identify the following concepts:
- Entity (aka a product, or an item) = a "record" in the traditional relational lingo
- Attribute = a "column" (aka a "field") in RDBMS lingo
- Value = the numeric (or string or otherwise) value of a given column for a given record.
- Type (aka category) = [loosely] a "table" in RDBMS, i.e. a set of records which share, generally, the same set of attributes.
To illustrate this with, say, an electronics goods catalog, an entity could be a particular "Flat Screen Monitor", its Type could be "Display Devices", its Attributes "Size", "Resolution", "Price" etc.
With EAV, the bulk of the information is stored in two tables called say the Product table, and the ProductAttributes table:
Product table
"ProductID" (primary key, the "EntityId")
"TypeId"
optionally, some common attributes found in all/most other Products, say...
price
ManufacturerId
Photo
ProductAttributes table
"ProductID" (Foreign Key to Product table)
"AttributeID" (FK to Attribute table)
"Value" (actual value; note: sometimes we can have several SQL fields for
this say IntValue, StringValue, DateValue, allowing to store
values in their natural format)
The tables above constitute the bulk of the data, and it is complemented by tables storing the [logical] schema of the catalog, also known as the "metadata". These tables include:
- Attributes table where the attributes are defined: Name, datatype, isRequired and such.
- Types Table where the types (categories) are defined: Name, possibly parent type in the case of hierarchical ontologies.
- Type_Attributes where the possible attributes for a given type are listed (eg: a "TV Set" has attribute "Number of channels", "Screen Size" etc. and a "VCR" has attribute "Number of Heads", "Format supported", "Body color" etc.
All this may seem somewhat complicated, compared with the traditional approach whereby the logical schema is "hardcoded" within the SQL schema, i.e. we have one "TVSets" Table with its set of columns one per attribute, and then a "VCR" table with its own, different set of columns/attributes. However with such an approach the application logic ends up hard-coding in some fashion (if only through an indirection in a map of sorts) the table and column names.
In contrast, the EAV model, allows the program to discover the list of possible types, and, for each type the list of possible attributes (either required or optional). Also, since the attribute values are all stored in the same table, it is possible to filter on attributes irrespective of the type (or sub-type) of the product. For example to get all items cheaper than 50 dollars (in the other approach we may have had to look in dozen of tables for that).
Back to the "drill-down" feature...
Once an initial search is performed (say searching all products where name [full-text indexed] contains the word "screen"), the ProductAttributes table can produce the distinct list of all different AttributeID (hence attribute name by lookup in Attributes table) for product satisfying this first search criteria.
Upon the user selecting a given attribute (say "Manufacturer", the ProductAttributes table can produce the distinct list of manufacturers (along with the number of products for each manufacturer). (alternatively such info can be searched initially rather than lazily, when the the user requests it).
The user then selects a given Manufacturer (or several), and a new query is ran to reduce the initial results list. The list of possible Attributes (and within each attributes the list of possible values) decreases, since some products (entities) initially selected are now excluded.
The process continues, providing the end user with guided search into the catalog. Of course the user may backtrack etc.
To maybe help with this wordy explanation (or maybe to further confuse the reader...) the following snippet provides a more precise indication of way this structure can be used to implement searches. This code is adapted for the table names used in the explanation above and may include a few typos, but generally provide the flavor of things. Also, this is written with a Common Table Expression (CTE) but could well be written as a subquery. Also not the that we do not join with the logical schema (meta data) tables, but that could be done too, to get the attribute names, type name and such, directly in the resultset.
As hinted earlier the queries and logic supporting this architecture are more complicated but also more versatile and tolerant of changes in the type of items stored and their attributes. Of course, this type of query is generated dynamically, based on the current list of search criteria supplied by the end-user.
WITH SearchQry AS (
SELECT ROW_NUMBER() OVER (ORDER BY P.EntityId ASC) AS RowNum,
P.EntityId AS EId
FROM Products P
INNER JOIN ProductAttributes PA1 ON P.EntitityId = PA1.EntityId and PA1.AttributeID = <some attribute id, say for Manufacturer>
INNER JOIN ProductAttributes PA2 ON P.EntitityId = PA2.EntityId and PA2.AttributeID = <some other attribute id, say for Color>
-- here for additional PAn JOINs as more criteria is added
WHERE P.ProductType IN (ProdId_x, ProdId_y, ProdId_z) -- for example where these x,y,z Ids correspond to say "TV Sets", "LapTop Computers" and "PDAs" respectively
AND PA1.Value = 'SAMSUNG' -- for example
AND PA2.Value = 'YELLOW' -- for example
GROUP BY P.EntityId
)
SELECT P.EntityId, PA.AttributeId, PA.Value -- PA.IntValue (if so structured)
FROM (SELECT * FROM SearchQry WHERE RowNum BETWEEN 1 AND 15) AS S
JOIN ProductAttributes PA ON PA.EntityId = S.EId
INNER JOIN Products P on P.EntityID = PA.EntityId
ORDER BY P.EntityId, P.AttributeId -- or some other sort order
Sorry for the long explanation, there's maybe [probably] a better description of this online, but I haven't found it...