Hi,
I have a table that has 100.000 rows, and soon it will be doubled. The size of the database is currently 5 gb and most of them goes to one particular column, which is a text column for PDF files. We expect to have 20-30 GB or maybe 50 gb database after couple of month and this system will be used frequently.
I have couple of questions regarding with this setup
1-) We are using innodb on every table, including users table etc. Is it better to use myisam on this table, where we store text version of the PDF files? (from memory usage /performance perspective)
2-) We use Sphinx for searching, however the data must be retrieved for highlighting. Highlighting is done via sphinx API but still we need to retrieve 10 rows in order to send it to Sphinx again. This 10 rows may allocate 50 mb memory, which is quite large. So I am planning to split these PDF files into chunks of 5 pages in the database, so these 100.000 rows will be around 3-4 million rows and couple of month later, instead of having 300.000-350.000 rows, we'll have 10 million rows to store text version of these PDF files. However, we will retrieve less pages, so again instead of retrieving 400 pages to send Sphinx for highlighting, we can retrieve 5 pages and it will have a big impact on the performance. Currently, when we search a term and retrieve PDF files that have more than 100 pages, the execution time is 0.3-0.35 seconds, however if we retrieve PDF files that have less than 5 pages, the execution time reduces to 0.06 seconds, and it also uses less memory.
Do you think, this is a good trade-off? We will have million of rows instead of having 100k-200k rows but it will save memory and improve the performance. Is it a good approach to solve this problem and do you have any ideas how to overcome this problem?
The text version of the data is used only for indexing and highlighting. So, we are very flexible.
Edit: We store pdf files on our cloud, however for search highlighting, we need to retrieve the text version of the pdf file and give it to the Sphinx, Sphinx then returns the highlighted 256 character text. To index pdf files we need to insert them into the database because they also have additional metadata, like description tags and title and we need to link them for search engine. If we index txt files or pdf files from the file server, it is not possible to get other data from the db and link them to those txt files on the search engine. So, we still store PDF files on our cloud, but the text version must be in our db as well to index their tag title and description as well. They are different tables, but it must be in the database as well.
Thanks,