views:

474

answers:

10

Ok, I have a need to perform some intensive text manipulation operations.

Like concatenating huge (say 100 pages of standard text), and searching in them etc. so I am wondering if MySQL would give me a better performance for these specific operations, compared to a C program doing the same thing?

Thanks.

+5  A: 

Any database is always slower than a flat-file program outside the database.

A database server has overheads that a program reading and writing simple files doesn't have.

S.Lott
And what about indices? They can speed up some operations tremendously! In a way that any flat-file-handling program can only dream of!
Joachim Sauer
@saua: indices make SLOW database operations less slow. A C program still has less overhead than a database application.
S.Lott
+3  A: 

In general the database will be slower. But much depends on the type of processing you want to do, the time you can devote for coding and the coding skills. If the database provides out-of-the-box the tools and functionality you need, then why don't give it a try, which should take much less time than coding own tool. If the performance turns out to be an issue then write your own solution.

But I think that MySQL will not provide the text manipulation operations you want. In Oracle world one has Text Mining and Oracle Text.

Anonymous
+1  A: 

Relational Databases are normally not good for handling large text data. The performance-wise strength of realtional DBs is the indexation and autmatically generated query plan. Freeform text does not work well in with this model.

Arkadiy
A: 

If you're literally talking about concatenating strings and doing a regexp match, it sounds like something that's worth doing in C/C++ (or Java or C# or whatever your favorite fast high-level language is).

Databases are going to give you other features like persistence, transactions, complicated queries, etc.

Jason S
A: 

MySQL is written in C, so it is not correct to compare it to a C program. It's itself a C program

dmityugov
He's asking about building his own C program specifically for this task and whether that would be more efficient than using MySQL.
Graeme Perrow
Hi Graeme,"A C program doing the same thing" is just too broad term. Provided you have unlimited time and resources, a C program doing the same thing will certainly be much faster than mySQL. However, if your time and resources are limited, reusing an existing solution (mySQL) can be faster
dmityugov
+1  A: 

If you are talking about storing plain text in one db field and trying to manipulate with data, then C/C++ sould be faster solution. In simple way, MySQL should be a lot bigger C programm than yours, so it must be slower in simple tasks like string manipulation :-)

Of course you must use correct algorithm to reach good result. There is useful e-book about string search algorithms with examples included: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

P.S. Benchmark and give us report :-)

Pawka
A: 

Thanks for all the answers.

I kind of thought that a DB would involve some overhead as well. But what I was thinking is that since my application required that the text be stored somewhere in the first place already, then the entire process of extracting the text from DB, passing it to the C program, and writing back the result into the DB would overall be less efficient than processing it within the DB??

A: 

With MySQL you can take advantage of full-text indices, which will be hundreds times faster, then directly searching through the text.

vartec
A: 

MySQL is fairly efficient. You need to consider whether writing your own C program would mean more or less records need to be accessed to get the final result, and whether more or less data needs to be transferred over the network to get the final result.

If either solution will result in the same number of records being accessed, and the same amount transferred over the network, then there probably won't be a big difference either way. If performance is critical then try both and benchmark them (if you don't have time to benchmark both then you probably want to go for whichever is easiest to implemnent anyway).

Mark Baker
+1  A: 

There are several good responses that I voted up, but here are more considerations from my opinion:

No matter what path you take: indexing the text is critical for speed. There's no way around it. The only choice is how complex you need to make your index for space constraints as well as search query features. For example, a simple b-tree structure is fast and easy to implement but will use more disk space than a trie structure.

Unless you really understand all the issues, or want to do this as a learning exercise, you are going to be much better off using an application that has had years of performance tuning.

That can mean a relational databases like MySQL even though full-text is a kludge in databases designed for tables of rows and columns. For MySQL use the MyISAM engine to do the indexing and add a full text index on a "blob" column. (Afaik, the InnoDB engine still doesn't handle full text indexing, so you need to use MyISAM). For Postgresql you can use tsearch.

For a bit more difficulty of implementation though you'll see the best performance integrating indexing apps like Xapian, Hyper Estraier or (maybe) Lucene into your C program.

Besides better performance, these apps will also give you important features that MySQL full-text searching is missing, such as word stemming, phrase searching, etc., in other words real full-text query parsers that aren't limited to an SQL mindset.

Van Gale