views:

53

answers:

3

I would like to know what data structure / storage strategy I should use for this problem.

Each data entry in the database consists of a list of multiple ordered items, such as A-B-C-D, where A, B, C, D are different items.

Suppose I have 3 entries in a database,

A-B-C-D

E-F-G

G-H-B-A

When the user entered some unordered items, I have to find the matching ordered entry(ies) from the database. For example, if user enters A,B,G,H, I want to return G-H-B-A from the database to the user.

What should be my data storage strategy?

Thanks.

A: 

database balanced b-tree indexes are just fine for this. what's stopping you from using them?

Mladen Prajdic
Maybe I am not very familiar with the topic. can you explain a bit more? For example, in sql table, how should I store the information (the ordered list) and how the query should be like? Thanks!
gilbertc
A: 

Split the lists into individual items and work on that level.

Some tables:

lists

  • ID (PK)
  • sequence (the "A-B-C-D" entries above)
  • [whatever else]

items

  • ID (PK)
  • name (value, word, whatever makes sense)
  • [whatever else]

list_items

  • list_ID
  • item_ID
  • [an ordinal int, if "G-H-B-A" and "A-B-G-H" are considered different sequences]

(composite PK list_ID, item_ID [, ordinal] on that one, basic many:many relation)

Some data, so it's more clear what the tables represent:

INSERT INTO items (ID, name) VALUES (1, 'A'), (2, 'B'), (3, 'G'), (4, 'H');
INSERT INTO lists (ID, sequence) VALUES (1, 'A-B-G-H');
INSERT INTO list_items (list_ID, item_ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4);
INSERT INTO lists (ID, sequence) VALUES (2, 'B-A-G');
INSERT INTO list_items (list_ID, item_ID) VALUES (2, 2), (2, 1), (2, 3);

And finally, to find lists that contain all items (A, B, G, H):

SELECT lists.sequence FROM lists
JOIN list_items ON lists.ID = list_items.list_ID
JOIN items AS i1 ON list_items.item_ID = i1.ID HAVING i1.name = 'A'
JOIN items AS i2 ON list_items.item_ID = i2.ID HAVING i2.name = 'B'
JOIN items AS i3 ON list_items.item_ID = i3.ID HAVING i3.name = 'G'
JOIN items AS i4 ON list_items.item_ID = i4.ID HAVING i4.name = 'H'

That should return any lists like "A-B-G-H", "G-H-A-B", "H-A-T-B-A-G", etc, but not "B-U-G-H-U-T" (no A) or "B-A-T-H" (no G) - all conditions have to be satisfied. Doing an "any" search might be a little more involved (writing this in my head over lunch, but RIGHT JOIN alone would probably result in all kinds of duplicates & slowness).

It won't map any genomes or redefine human language, but should be okay for a decent-sized data set. Either way, I'd avoid storing each list as a varchar and doing "WHERE sequence LIKE '%A%' AND sequence LIKE '%B%'" stuff unless you absolutely can't handle the extra work to add new data.

tadamson
I may have a large number of items in the list. The JOINs would be too expensive.
gilbertc
+1  A: 

You're best off storing the ordered and unordered elements separately, otherwise you'll need to search on all permutations of the ordered elements, which would be time consuming.

Try this:

/* Create a table to track your items (A, B, C, etc.). It contains all possible elements */
CREATE TABLE [Items](
    [Value] [char](1) NOT NULL,
 CONSTRAINT [PK_Items] PRIMARY KEY CLUSTERED ([Value]))

/* Create a table to track their grouping and stated ordering */
CREATE TABLE [Groups](
    [ID] [int] NOT NULL,
    [Order] [text] NOT NULL,
 CONSTRAINT [PK_Groups] PRIMARY KEY CLUSTERED ([ID]))

/* Create a mapping table to associate them */
CREATE TABLE [ItemsToGroups](
    [Item] [char](1) NOT NULL,
    [Group] [int] NOT NULL
)

ALTER TABLE [ItemsToGroups]  WITH CHECK ADD CONSTRAINT [FK_ItemsToGroups_Groups] FOREIGN KEY([Group])
REFERENCES [Groups] ([ID])

ALTER TABLE [ItemsToGroups] CHECK CONSTRAINT [FK_ItemsToGroups_Groups]

ALTER TABLE [ItemsToGroups]  WITH CHECK ADD CONSTRAINT [FK_ItemsToGroups_Items] FOREIGN KEY([Item])
REFERENCES [Items] ([Value])

ALTER TABLE [ItemsToGroups] CHECK CONSTRAINT [FK_ItemsToGroups_Items]

/* Populate your tables. 
   Items should have eight rows: A, B, C,...H
   Groups should have three rows: 1:ABCD, 2:EFG, 3:GHBA
   Items to groups should have eleven rows: A:1, B:1,...A:3 */

/* You will want to pass in a table of values, so set up a table-valued parameter
   First, create a type to support your input list */
CREATE TYPE ItemList AS TABLE (e char(1) NOT NULL PRIMARY KEY)
DECLARE @Input ItemList
GO

/* Create a stored procedure for your query */
CREATE PROCEDURE SelectOrderedGroup @Input ItemList READONLY AS
    SELECT *
    FROM Groups
    WHERE Groups.ID NOT IN (
        SELECT [Group]
        FROM ItemsToGroups
        WHERE Item NOT IN (SELECT e FROM @Input)
    )
GO

/* Now when you want to query them: */
DECLARE @MyList ItemList
INSERT @MyList(e) VALUES('G'),('H'),('B'),('A')
EXEC SelectOrderedGroup @MyList

The above will return 3:GHBA, like you want. If you pass in DCBA you'll get back 1:ABCD, again like you're looking for. If you pass in C, you'll get back nothing, as no group consists of just C.

You will probably want to use a table-valued parameter for your input, as shown above, but you could convert the final SELECT to a simple list and drop the ItemList type.

fatcat1111