views:

431

answers:

4

There is a table:

doc_id(integer)-value(integer)

Approximate 100.000 doc_id and 27.000.000 rows.

Majority query on this table - searching documents similar to current document:

select 10 documents with maximum of 
     (count common to current document value)/(count ov values in document).

Nowadays we use PostgreSQL. Table weight (with index) ~1,5 GB. Average query time ~0.5s - it is to hight. And, for my opinion this time will grow exponential with growing of database.

Should I transfer all this to NoSQL base, if so, what?

QUERY:

EXPLAIN ANALYZE
SELECT D.doc_id as doc_id,
  (count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc 
FROM testing.text_attachment D
WHERE D.doc_id !=29758 -- 29758 - is random id
  AND D.doc_crc32 IN (select testing.get_crc32_rows_by_doc_id(29758)) -- get_crc32... is IMMUTABLE
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10

Limit  (cost=95.23..95.26 rows=10 width=8) (actual time=1849.601..1849.641 rows=10 loops=1)
   ->  Sort  (cost=95.23..95.28 rows=20 width=8) (actual time=1849.597..1849.609 rows=10 loops=1)
         Sort Key: (((((count(d.doc_crc32))::numeric * 1.0) / (testing.get_count_by_doc_id(d.doc_id))::numeric))::real)
         Sort Method:  top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=89.30..94.80 rows=20 width=8) (actual time=1211.835..1847.578 rows=876 loops=1)
               ->  Nested Loop  (cost=0.27..89.20 rows=20 width=8) (actual time=7.826..928.234 rows=167771 loops=1)
                     ->  HashAggregate  (cost=0.27..0.28 rows=1 width=4) (actual time=7.789..11.141 rows=1863 loops=1)
                           ->  Result  (cost=0.00..0.26 rows=1 width=0) (actual time=0.130..4.502 rows=1869 loops=1)
                     ->  Index Scan using crc32_idx on text_attachment d  (cost=0.00..88.67 rows=20 width=8) (actual time=0.022..0.236 rows=90 loops=1863)
                           Index Cond: (d.doc_crc32 = (testing.get_crc32_rows_by_doc_id(29758)))
                           Filter: (d.doc_id <> 29758)
 Total runtime: 1849.753 ms
(12 rows)
A: 

First, is 0.5s a problem or not? And did you already optimize your queries, datamodel and configuration settings? If not, you can still get better performance. Performance is a choice.

Besides speed, there is also functionality, that's what you will loose.

===

What about pushing the function to a JOIN:

EXPLAIN ANALYZE
SELECT 
    D.doc_id as doc_id,
    (count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc 
FROM 
    testing.text_attachment D
        JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE 
    D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10
Frank Heikens
Yes, 0.5s is a problem, because expected in the near future a significant increase in the size of the table, so time for query will grow to.Sure, db and queries was optimized.There is no other functionality on this table, except searching similar documents.
potapuff
this query reduce -> HashAggregate (cost=0.27..0.28 rows=1 width=4) (actual time=7.926..11.324 rows=1863 loops=1)line, but save only few milliseconds.
potapuff
+3  A: 

1.5 GByte is nothing. Serve from ram. Build a datastructure that helps you searching.

Stephan Eggermont
A: 

If you're getting that bad performance out of PostgreSQL, a good start would be to tune PostgreSQL, your query and possibly your datamodel. A query like that should serve a lot faster on such a small table.

Magnus Hagander
+1  A: 

I don't think your main problem here is the kind of database you're using but the fact that you don't in fact have an "index" for what you're searching: similarity between documents.

My proposal is to determine once which are the 10 documents similar to each of the 100.000 doc_ids and cache the result in a new table like this:

doc_id(integer)-similar_doc(integer)-score(integer)

where you'll insert 10 rows per document each of them representing the 10 best matches for it. You'll get 400.000 rows which you can directly access by index which should take down search time to something like O(log n) (depending on index implementation).

Then, on each insertion or removal of a document (or one of its values) you iterate through the documents and update the new table accordingly.

e.g. when a new document is inserted: for each of the documents already in the table

  1. you calculate its match score and
  2. if the score is higher than the lowest score of the similar documents cached in the new table you swap in the similar_doc and score of the newly inserted document
Utaal