ansaurus

Question

Textual Irregularities

Answer 1

+2 A:

If you are into Python, you might try difflib.

It's not an exact solution to your problem, but it might be helpful.

yk4ever 2009-02-06 08:29:29

Answer 2

+2 A:

Depending on what sort of real life irregularities you want to find or correct this problem is radically different.

Here is your example updated with real text:

1. Lazarus Long, Get the first shot off fast.
2. Hiro Protagonist, Greatest swordfighter[sic] in the world.
3. Alice , Down the rabbit hole.
5. Orem, Sink of power.

In this example the errors could be fixed with a decent text editor with find an replace. Text editors and hex editors can work miracles if you get creative with wildcards. The problem remains simple as long as your delimiting factors are in existence (. or ,). As you have probably already know; as soon as one of those is missing the problem becomes much more complex.

Example of a hard problem:

1. Lazarus Long, Get the first shot off fast.
 2. Hiro Protagonist  Greatest swordfighter[sic] in the world.
3. Alice , Down the rabbit hole.
5 . Orem, , Sink of power.

I would probably attack this in a few steps. 1. Clean up extra spaces. 2. Find out key statistics such as the number of delimiters per line and the avg number of words or characters per delimited column. Most names are one or two words, comments are unknown or limited by input. 3. Find lines with a statistically improbably number of key features. 4. Try your best to correct them.

I understand that this is not directly solving your problem, but maybe one idea can patch your problem over for a bit. It is possible that past wheel wrights never completed any designs.

Phil 2009-02-06 08:54:49

Answer 3

A:

Sounds basically like you'd want to use Regex to create an "ideal response" then compare the rest of the lines against it.

Or you could write a more complicated program which would boil each line down into a Regex query, and then compare the queries to each other to see which ones are different.

2009-02-06 13:20:06

ansaurus

tags:

views:

answers:

Textual Irregularities

related questions