views:

119

answers:

3

Hi.

I need to manipulate large strings in Java (deleting and adding the deleted chars again, moving chars around), but still want to remember the original position offsets. E.g. the word "computer" starts at offset 133 in the original text and is then moved to position 244, I still want the info that it was originally at position 133. The most ugly (and resource hungry) solution would be to store for every character its original position plus it's position change. There are surely better solutions, but also more complex ones. Are there any good text manipulation libraries that have a solution to my problem? I don't want to reinvent the wheel.

Regards, Kai

+1  A: 

The problem you are referring to is officially called the "String-to-string correction problem" which is related to Delta Encoding and the Levenshtein Distance. Here is code to compute the distance (it's in Java). All the differencing code is there, you simply have to add code that keeps track of the steps so you can reverse them or track them. Note: "moving" a word or character would be a delete/insert pair of the same word that occurs together.

This should work for both character, word, and substring moves.

colithium
Good point, but I don't think I need to calculate the Levenshtein distances as I am already aware of what is edited. For example I always get "informed" if something is deleted inside the document.
Zardoz
+2  A: 

How large are these strings ? Given the quantities of memory available today, brute force may be the way to go.

You talk about moving words, but storing character positions. Why not store word positions, and a history per instance of word. Note that you could be clever and make use of the flyweight pattern to save having multiple instances of these objects until you require. i.e. your 'string' object holds one 'computer' word object, but records that that word occurs at position 133, 245, 667 etc. (plus history as and when you need it)

Brian Agnew
Yep, that is what I meant by more complex solutions ;-) But you are right, if nobody comes up with a cool library for that task, I will go that way.
Zardoz
A: 

Before getting to stressed about efficiency, do a back of an envelope calculation. When you are okay with that and have code, you can double check with a profiler/stopwatch.

There is a ready made solution in the form of Swing text. It should be usable outside of a Swing context, although IIRC it tries to fire exceptions on the EDT (in the typical Swing thread-hostile way) - might want to check on that. There are Position objects that keep track of character positions within a Document even after insertions and deletions. If nothing else, it'll show how it can be done. Presumably the Apache Harmony implementation comes with a licence suitable for most normal people.

Tom Hawtin - tackline