Java: Parsing a text for words from a list (acronyms, abbr., etc)

tags:

java
parsing

views:

answers:

Java: Parsing a text for words from a list (acronyms, abbr., etc)

Given:

A text (optional with HTML tags)
a database table with abbreviations and acronyms (like "etc.", "s.o.", ...)

Goals:

Build a parser that finds all occurrences in the given text
Build a small gui to let the user choose if the found occurrence matches (this will be swing by demand)
User has the option to ignore a match (must also be marked as "to be ignored")
Replace any accepted occurrence with a special XML construct

My main problem is the parser, I've mentioned the GUI just for giving a complete overview.

The task is to build a parser that analyzes the text for e.x. an acronym and mark it for later postprocessing. Any "mark" must me in form of XML tags, as the surrounding environment does not accept anything else (We are in a DOM Editor of a CMS that ends with "Spirit" ;) ).

Does anybody has a hint for a library or did anybody build something like this? How did you or would you handle things like:

Two or more words are one entity
fullstop - part of the sentence or part of the token you are looking for
iterative replacement - user accepts the first occurrence - instant replace or buffering?

Any idea, library hint, wikipedia article, whatever - is helpful. I didn't find any related question that answered all of the aspects mentioned above.

+2 A:

I've read much good things about apache lucene and I'd look at this first if I had a similar project. It can index the source document and help to find all occurences of your acronyms (that's what you want as a result from the 'parsing' step, if I got it right).

Andreas_D 2010-08-12 06:56:08

Yes, this is the goal of the parsing step, but I've to mark/replace the occurrences step by step. This means the first occurrence of "e.x." may be untouched, but the second occurrence is to be replaced. This depends on the choice of the user, who can click a checkbox beside each hit. I do not know lucene that good, is it possible to highlight the occurences and get their position in the text afterwards?

Mario Mueller 2010-08-12 07:43:47

Lucene is a search engine. It does not touch the text (highlighting) but creates a word index. Then you can use that index to get the positions of the search results and you can use those positions (and lengths) to apply some highlighting/tagging to the source text.

Andreas_D 2010-08-12 08:19:25

+1 Andreas_D: that's a really nice lib

LB 2010-08-12 09:14:53

+1 A:

Use a SAX parser of some sort, that runs on the input. For every hit you pause the parsing, show it in gui and let the user choose what to do. While parsing you build a DOM tree in the background.

Every time the user replaces something, you replace the given element in that DOM tree (you know which it is, since your holding the element that the user needs to react on).

When the whole thing is parsed and replaced you simply print out the DOM tree.

Jes 2010-08-12 08:04:47

SAX parser is a good direction, but the OP needs to find acronyms. The parser will report a text (CDATA?) element for anything between tags but we need to parse the content of this chunk to find acronyms.

Andreas_D 2010-08-12 08:22:11

And that's where you could use Lucene or something of the liking, and manipulate the data of the element on-the-fly. Lucene is great for searching, once configured, and should work for this application too. The indexing part of it may be overkill, but I don't know the size of the text data retrieved.

Jes 2010-08-12 09:10:24

@Jes: from 5 to 500 words, rarely more than 500, but always less than 1000.

Mario Mueller 2010-08-12 10:09:45

@Mario then Lucene is the way to go, otherwise you will need to implement it yourself, and that's just a waste of work :)

Jes 2010-08-12 11:20:49

ansaurus

tags:

views:

answers:

Java: Parsing a text for words from a list (acronyms, abbr., etc)

related questions