views:

45

answers:

3

Hello

I'm in the process of setting up a system which will have to repeatedly parse large amounts of text (as a String or StringBuffer - which might be better?) acquired from the a data source. The text will be displayed and may consist of several thousand words and each time the text is parsed, each word may have to checked against a list of 550 stop words. This will allow the words to be filtered from display.

So I wonder about performance as this could be going on in multiple servlet sessions at any one time; is it better to check each word against a MySQL database table (MyISAM or InnoDB) using an index? Or simply to store the 550 words in a Java array or arraylist within servlet context so they possibly be read more quickly?

So I wonder about the trade off between database IO against storing 550 strings in memory.

Any advice?

Thanks

Mr Morgan.

+1  A: 

550 String is a very small amount of data for today's servers : you don't need the database, it will be much slower.

Jean-Philippe Caruana
I am amenable to this especially as StringBuffers have the trimToSize method. so an array or list of these would be helpful.
Mr Morgan
+1  A: 

Assuming that the "data source" is not your database, you can get better performance by doing the stopword search in memory rather than asking the database for do it. It stands to reason:

  • Any algorithm that the database uses can equally be used as your in-memory algorithm.
  • By running the algorithm locally, you avoid the cost of sending the text to the database and sending the results back.

It is also likely that you can implement a better algorithm for detecting the stop-words than a general purpose database engine could. And the memory needed for a data structure that represents the 500 or so stopwords should be trivial compared with the space used by the rest of your application, the servlet container and all of the libraries that you use.

Stephen C
The data source will initially be an RSS feed. Text extracted from it will be stored in a String and parsed against the stop words for display. But this will happen repeatedly. So I want to keep most of this processing in session memory. And the stop words in servlet context memory.
Mr Morgan
@Mr Morgan - that's fine. Doesn't change my answer.
Stephen C
@Stephen C: Very true. Thanks. Now it's just the listener to do the writing to servlet context.
Mr Morgan
A: 

I recommend using a standard Java Properties file, since you don't have that much data. This lets you use the standard Internationalization/Locale features.

This assumes, of course, that the copy changes fairly slowly. But that is usually the case.

fishtoprecords