tags:

views:

123

answers:

6

Dear Everyone, I have a table,

| PAGELETS | CREATE TABLE `PAGELETS` (
  `page_key` int(32) unsigned NOT NULL,
  `pagelet_serial` int(32) unsigned NOT NULL,
  `pagelet_shingle` int(32) unsigned NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

I would like to:

1) Find all the pagelet_shingles where quantity > 1 ( occurs more than once)
2) out of these only output those that have different page_key

This is the query that produces the a semi-correct answer:

SELECT * FROM PAGELETS WHERE pagelet_shingle IN( SELECT pagelet_shingle FROM PAGELETS GROUP BY pagelet_shingle HAVING COUNT(DISTINCT page_key) > 1) ORDER BY pagelet_shingle;

Unfortunately, on a small dataset it takes about 18 seconds;

I have another query,

SELECT dt1.* FROM 
(SELECT * FROM PAGELETS 
GROUP BY page_key, pagelet_shingle HAVING COUNT(*) = 1) 
dt1 JOIN 
(SELECT * FROM PAGELETS GROUP BY pagelet_shingle HAVING COUNT(*) > 1) 
dt2 USING (pagelet_shingle) ORDER BY pagelet_shingle

given by an expert which is not technically correct (something to do with you can't SELECT * .. GROUP ) but produces results that are A LOT faster, with the case where

SELECT * FROM PAGELETS WHERE pagelet_shingle=57

+----------+----------------+-----------------+
| page_key | pagelet_serial | pagelet_shingle |
+----------+----------------+-----------------+
|        1 |             99 |              57 | 
|        1 |             99 |              57 | 
|        2 |            228 |              57 | 
|        2 |            228 |              57 | 
+----------+----------------+-----------------+

The semi-correct query produces

+----------+----------------+-----------------+
| page_key | pagelet_serial | pagelet_shingle |
+----------+----------------+-----------------+
|        1 |             99 |              57 | 
|        1 |             99 |              57 | 
|        2 |            228 |              57 | 
|        2 |            228 |              57 | 
+----------+----------------+-----------------+

While the incorrect query doesn't have pagelet_shingle =57 in its resultset

My desired result is to have

+----------+----------------+-----------------+
| page_key | pagelet_serial | pagelet_shingle |
+----------+----------------+-----------------+
|        1 |             99 |              57 |  
|        2 |            228 |              57 | 
+----------+----------------+-----------------+

Each occuring once only.

a pagelet_shingle occuring twice in the same pagelet_serial will be omitted.

So I would like to ask the following: 1) Is there a way to to speed up the csemi orrect query to reach the speed of the incorrect one 2) or is there a way to fix the incorrect one to produce the result of the correct one ( I don't care about strictness )

A: 

Sounds like SELECT DISTINCT p.* ... would be your choice.

P.S. And I would really recommend the second one! make everything slow (like you just noticed) and should only be used where necessary.

Marcel J.
Actually why would you recommend the second one if it's slow? I don't want to use the second one because this will be applied to a dataset for over 20 times its current size.SELECT DISTINCT * FROM PAGELETS WHERE pagelet_shingle IN( SELECT pagelet_shingle FROM PAGELETS GROUP BY pagelet_shingle HAVING COUNT(DISTINCT page_key) > 1) ORDER BY pagelet_shingle;Solves it, but any way to speed this up using index? ( I don't know which I should index for this matter, I tried indexing key(page_shingle, page_key) but it was equally slow
Whoops, was a little bit to early for me. Of course I meant the first one.
Marcel J.
A: 

doesn't this query solve your issue?

SELECT dt1.* FROM 
(SELECT DISTINCT * FROM PAGELETS 
GROUP BY page_key, pagelet_shingle HAVING COUNT(*) = 1) 
dt1 JOIN 
(SELECT * FROM PAGELETS GROUP BY pagelet_shingle HAVING COUNT(*) > 1) 
dt2 USING (pagelet_shingle) GROUP BY pagelet_shingle
pixeline
Nope - not in MySQL (Ref: ENGINE=MyISAM)
goddva
(1,64,8)(1,64,9)(1,64,10)(1,64,11)(1,64,12)(1,64,13)(1,64,14)(1,64,15)(1,64,16)(1,41,20)(1,41,21)(1,41,22)(1,99,48)(1,99,49)(1,99,50)(1,99,51)(1,99,52)(1,99,53)(1,99,54)(1,99,58)(1,99,59)(1,99,60)(1,99,61)Actually not really I can't see anything that has different page_key values; the one where page_key = 57 problem still exist( it's not found inside the result set becuase it occurs more than once)
A: 

What is

SELECT * FROM PAGELETS GROUP BY pagelet_serial, pagelet_shingle HAVING COUNT(*) > 0

giving you?

goddva
| page_key | pagelet_serial | pagelet_shingle |+----------+----------------+-----------------+| 1 | 56 | 1 | | 1 | 56 | 2 | | 1 | 56 | 3 | | 2 | 186 | 8 | | 1 | 64 | 8 | | 1 | 64 | 9 | | 2 | 186 | 9 | | 1 | 64 | 10 | | 2 | 186 | 10 |
Not what I really want:(1,56,1)(1,56,2)(1,56,3)(2,186,8)(1,64,8)(1,64,9)(2,186,9)(1,64,10)(2,186,10)(1,64,11)(2,186,11)(1,64,12)(2,186,12)(1,64,13)(2,186,13)(1,64,14)(2,186,14)(1,64,15)(2,186,15)(1,64,16)(2,186,16)(1,41,20)(2,203,20)(1,41,21)(2,203,21)(2,203,22)(1,41,22)(1,21,27)(1,21,28)(1,21,29)(1,21,30)(1,21,31)(1,21,32)(1,21,33)(1,21,34)(1,21,35)(1,21,36)(1,21,37)(1,21,38)(1,21,39)(1,21,40)(1,21,41)(1,21,42)(1,21,43)(1,21,44)(2,228,48)(1,99,48)(2,228,49)(1,99,49)(2,228,50)(1,99,50)(2,228,51)(1,99,51)(2,228,52)(1,99,52)
A: 

use GROUP BY and HAVING, e.g.

  SELECT *
    FROM `pagelets`
GROUP BY `pagelet_shingle`
  HAVING COUNT(*) > 1

additionally you can do a self join to output all columns, though in mysql it should work that way (different from SQL standard)

knittl
A: 

Judging from what I read, what you are looking for is:

SELECT DISTINCT p1.page_key, p1.pagelet_serial, p1.pagelet_shingle
  FROM PAGELETS p1
  JOIN PAGELETS p2 ON p2.page_key         = p1.page_key
                  AND p2.pagelet_serial   = p1.pagelet_serial
                  AND p2.pagelet_shingle <> p1.pagelet_shingle

That query would make full use of an index on (page_key, pagelet_serial) and should complete in tenth of seconds, not seconds.

If this was not what you were looking for, please show us what result you would expect if the values in your table were those: (1,2,3),(1,2,3),(1,1,3),(1,1,3),(1,2,4),(1,2,4),(1,1,4),(1,1,4)

Josh Davis
A: 

Have you tried using exists instead of in ?

Check this out: http://decipherinfosys.wordpress.com/2007/01/30/in-vs-exists/

Hope this helps

Pablo Cabrera