ansaurus

Question

Help to choose NoSQL database for project

Answer 1

A:

First, is 0.5s a problem or not? And did you already optimize your queries, datamodel and configuration settings? If not, you can still get better performance. Performance is a choice.

Besides speed, there is also functionality, that's what you will loose.

===

What about pushing the function to a JOIN:

EXPLAIN ANALYZE
SELECT 
    D.doc_id as doc_id,
    (count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc 
FROM 
    testing.text_attachment D
        JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE 
    D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10

Frank Heikens 2010-03-21 08:00:57

Yes, 0.5s is a problem, because expected in the near future a significant increase in the size of the table, so time for query will grow to.Sure, db and queries was optimized.There is no other functionality on this table, except searching similar documents.

potapuff 2010-03-21 10:31:05

this query reduce -> HashAggregate (cost=0.27..0.28 rows=1 width=4) (actual time=7.926..11.324 rows=1863 loops=1)line, but save only few milliseconds.

potapuff 2010-03-23 09:29:16

Answer 2

+3 A:

1.5 GByte is nothing. Serve from ram. Build a datastructure that helps you searching.

Stephan Eggermont 2010-03-21 09:20:31

Answer 3

A:

If you're getting that bad performance out of PostgreSQL, a good start would be to tune PostgreSQL, your query and possibly your datamodel. A query like that should serve a lot faster on such a small table.

Magnus Hagander 2010-03-21 10:05:39

Answer 4

+1 A:

I don't think your main problem here is the kind of database you're using but the fact that you don't in fact have an "index" for what you're searching: similarity between documents.

My proposal is to determine once which are the 10 documents similar to each of the 100.000 doc_ids and cache the result in a new table like this:

doc_id(integer)-similar_doc(integer)-score(integer)

where you'll insert 10 rows per document each of them representing the 10 best matches for it. You'll get 400.000 rows which you can directly access by index which should take down search time to something like O(log n) (depending on index implementation).

Then, on each insertion or removal of a document (or one of its values) you iterate through the documents and update the new table accordingly.

e.g. when a new document is inserted: for each of the documents already in the table

you calculate its match score and
if the score is higher than the lowest score of the similar documents cached in the new table you swap in the similar_doc and score of the newly inserted document

Utaal 2010-03-21 11:30:41

ansaurus

tags:

views:

answers:

Help to choose NoSQL database for project

related questions