views:

43

answers:

3

Hi,

I have a table that has 100.000 rows, and soon it will be doubled. The size of the database is currently 5 gb and most of them goes to one particular column, which is a text column for PDF files. We expect to have 20-30 GB or maybe 50 gb database after couple of month and this system will be used frequently.

I have couple of questions regarding with this setup

1-) We are using innodb on every table, including users table etc. Is it better to use myisam on this table, where we store text version of the PDF files? (from memory usage /performance perspective)

2-) We use Sphinx for searching, however the data must be retrieved for highlighting. Highlighting is done via sphinx API but still we need to retrieve 10 rows in order to send it to Sphinx again. This 10 rows may allocate 50 mb memory, which is quite large. So I am planning to split these PDF files into chunks of 5 pages in the database, so these 100.000 rows will be around 3-4 million rows and couple of month later, instead of having 300.000-350.000 rows, we'll have 10 million rows to store text version of these PDF files. However, we will retrieve less pages, so again instead of retrieving 400 pages to send Sphinx for highlighting, we can retrieve 5 pages and it will have a big impact on the performance. Currently, when we search a term and retrieve PDF files that have more than 100 pages, the execution time is 0.3-0.35 seconds, however if we retrieve PDF files that have less than 5 pages, the execution time reduces to 0.06 seconds, and it also uses less memory.

Do you think, this is a good trade-off? We will have million of rows instead of having 100k-200k rows but it will save memory and improve the performance. Is it a good approach to solve this problem and do you have any ideas how to overcome this problem?

The text version of the data is used only for indexing and highlighting. So, we are very flexible.

Edit: We store pdf files on our cloud, however for search highlighting, we need to retrieve the text version of the pdf file and give it to the Sphinx, Sphinx then returns the highlighted 256 character text. To index pdf files we need to insert them into the database because they also have additional metadata, like description tags and title and we need to link them for search engine. If we index txt files or pdf files from the file server, it is not possible to get other data from the db and link them to those txt files on the search engine. So, we still store PDF files on our cloud, but the text version must be in our db as well to index their tag title and description as well. They are different tables, but it must be in the database as well.

Thanks,

A: 

it sounds like you don't really need to retrieve your entire pdf file every time you hit on a row for that pdf file.

are you separating the metadata about your pdf files from the file itself? you definitely shouldn't have just one table here. you might want something like table pdf_info with 100 columns (do you really have that much metadata? why 100 columns?) and a foreign key to the the pdf_files table containing the actual text for the files. then you can experiment with, maybe, making the info table innodb and the files table myisam.

IMHO: there are many, many reasons to NOT store your pdf file in the mysql database. i would just store the file paths to a SAN or some other file distribution mechanism. sql is good for storing any abstract data, and files are certainly in that category. but file systems are specifically designed to store files, and webservers specifically designed to deliver those files to you as quickly as possible. so... just something to think about.

Igor
we store pdf files on our cloud, however for search highlighting, we need to retrieve the text version of the pdf file and give it to the Sphinx, Sphinx then returns the highlighted 256 character text. To index pdf files we need to insert them into the database because they also have additional metadata, like description tags and title and we need to link them for search engine. If we index txt files or pdf files from the file server, it is not possible to get other data from the db and link them to those txt files on the search engine.
murat
A: 

That sounds like a really bad technology choice. If you can slow the growth so you can keep everything in memory (affordable to 128GB or so) or partion for a larger size, you can basically be network transfer limited.

[edit] If the pdfs are on disk, and not in ram, your disk needs to be accessed. If you don't have a SSD, you can do that 50 times/second/disk. As long as a pdf is smaller than a disk track, splitting is not very interesting. If you split the pdfs and then need access to all parts, you might need to load from several tracks, slowing you down a lot.

Handling large documents with a RDBMs in a multi-user setup is not a good idea, performance wise.

Stephan Eggermont
Do you have any suggestion if it's a bad choice?
murat
Oh sorry you mean splitting is a bad choice. Yes if we buy enough ram to our DB server and web server, there won't be any problem, but the data may grow ver large in long term so if there is a elegant way to solve this problem, I'd like to hear it. Otherwise, if spliting won't solve this problem, we'll buy a lot of ram.
murat
A: 

Use Solr, it is possible index text files with their metadata from a database. I have switched the search engine to Solr.

murat