Given:
- A text (optional with HTML tags)
- a database table with abbreviations and acronyms (like "etc.", "s.o.", ...)
Goals:
- Build a parser that finds all occurrences in the given text
- Build a small gui to let the user choose if the found occurrence matches (this will be swing by demand)
- User has the option to ignore a match (must also be marked as "to be ignored")
- Replace any accepted occurrence with a special XML construct
My main problem is the parser, I've mentioned the GUI just for giving a complete overview.
The task is to build a parser that analyzes the text for e.x. an acronym and mark it for later postprocessing. Any "mark" must me in form of XML tags, as the surrounding environment does not accept anything else (We are in a DOM Editor of a CMS that ends with "Spirit" ;) ).
Does anybody has a hint for a library or did anybody build something like this? How did you or would you handle things like:
- Two or more words are one entity
- fullstop - part of the sentence or part of the token you are looking for
- iterative replacement - user accepts the first occurrence - instant replace or buffering?
Any idea, library hint, wikipedia article, whatever - is helpful. I didn't find any related question that answered all of the aspects mentioned above.