views:

36

answers:

2

To add full text search to my App Engine app I've added the following field to my model:

private List<String> fullText;

To test the search, I took the following text:

Oxandrolone is a synthetic anabolic steroid derived from dihydrotestosterone  by substituting 2nd carbon atom for oxygen (O). It is widely known for its exceptionally small level of androgenicity accompanied by moderate anabolic effect. Although oxandrolone is a 17-alpha alkylated steroid, its liver toxicity is very small as well. Studies have showed that a daily dose of 20 mg oxandrolone used in the course of 12 weeks had only a negligible impact on the increase of liver enzymes[1][2]. As a DHT derivative, oxandrolone does not aromatize (convert to estrogen, which causes gynecomastia  or male breast tissue). It also does not significantly influence the body's normal testosterone production (HPTA axis) at low dosages (10 mg). When dosages are high, the human body reacts by reducing the production of LH (luteinizing hormone), thinking endogenous testosterone production is too high; this in turn eliminates further stimulation of Leydig cells in the testicles, causing testicular atrophy (shrinking). Oxandrolone used in a dose of 80 mg/day suppressed endogenous testosterone by 67% after 12 weeks of therapy[3].

And applied this Java code to it:

 StringTokenizer st = new StringTokenizer(recordText);
 List<String> fullTextSearchSupport = new ArrayList<String>();
 while (st.hasMoreTokens())
 {
  String token = st.nextToken().trim();
  if (token.length() > 3)
  {
   fullTextSearchSupport.add(token);
  }
 }

I got back the following ArrayList of String tokens:

[Oxandrolone, synthetic, anabolic, steroid, derived, from, dihydrotestosterone, substituting, carbon, atom, oxygen, (O)., widely, known, exceptionally, small, level, androgenicity, accompanied, moderate, anabolic, effect., Although, oxandrolone, 17-alpha, alkylated, steroid,, liver, toxicity, very, small, well., Studies, have, showed, that, daily, dose, oxandrolone, used, course, weeks, only, negligible, impact, increase, liver, enzymes[1][2]., derivative,, oxandrolone, does, aromatize, (convert, estrogen,, which, causes, gynecomastia, male, breast, tissue)., also, does, significantly, influence, body&#039;s, normal, testosterone, production, (HPTA, axis), dosages, mg)., When, dosages, high,, human, body, reacts, reducing, production, (luteinizing, hormone),, thinking, endogenous, testosterone, production, high;, this, turn, eliminates, further, stimulation, Leydig, cells, testicles,, causing, testicular, atrophy, (shrinking)., Oxandrolone, used, dose, mg/day, suppressed, endogenous, testosterone, after, weeks, therapy[3].]

What surprised me is that the StringTokenizer leaves in punctuation such as commas, periods, brackets and parentheses when breaking up the String into tokens.

For example, for a text search, the token:

derivative,

could simply be

derivative

and

enzymes[1][2].

could simply be:

enzymes

What's the best way to produce only English word output that would be needed in a text search, excluding punctuation and special characters?

I tried to reduce smaller joining words (a, by, for) with this condition:

token.length() > 3

but obviously that is not enough.

+1  A: 

If you feel that your list might be a constant set you can do something silly like:

StringTokenizer(v, " .,?!:;()\b\t\n\f\r\"\'\");

or you could do a search and replace for the character values outisde of 65-90 and 97-122.

tathamr
+2  A: 

Yes, the default delimiters are whitespace characters, but you can specify your own using the two-argument constructor:

StringTokenizer st = new StringTokenizer(recordText, ".,! ()[]");
Alan Moore