tags:

views:

267

answers:

12

I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:


Federal prosecutors on Monday charged a Miami man with the largest 
case of credit and debit card data theft ever in the United States, 
accusing the one-time government informant of swiping 130 million 
accounts on top of 40 million he stole previously.

, ':, Ie
':... 11'1
. '(.. ~!' ': f I I
. " .' I ~
I' ,11 l
I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.
, i
I ; J . I.' ,.\ ) ..
. : I
'I', I
.' '
r,"

Gonzalez is a former informant for the U.S. Secret Service who helped 
the agency hunt hackers, authorities say. The agency later found out that 
he had also been working with criminals and feeding them information 
on ongoing investigations, even warning off at least one individual, 
according to authorities.

eh....l
~.\O ::t
e;~~~
s: ~ ~. 0
qs c::; ~ g
o t/J (Ii .,
::3 (1l Il:l
~ cil~ 0 2:
t:lHj~(1l
. ~ ~a
0~ ~ S'
N ("b t/J :s
Ot/JIl:l"-<:!
v'g::!t:O
-....c......
VI (:ll <' 0
:= - ~
< (1l ::3
(1l ~ '
t/J VJ ~
Pl
.....
....
(II
A: 

Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?

There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.

Russell
I did some research at Uni a few years ago around phrase extraction (you are kind of doing the opposite). There are lots of papers (eg. http://portal.acm.org/citation.cfm?id=1097059) on it however unfortunately there is no single "hit-all" solution.
Russell
A: 

Interesting problem.

If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.

Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.

Regardless, I think there will have to be some programming logic, not simply a single regular expression.

Devin Ceartas
A: 

I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random. One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.

Rohin
+1  A: 

One of the simpleset solutions(not involving regexpes):

#pseudopython

number_of_punct = sum([1 if c.ispunct() else 0 for c in line])

if number_of_punct >len(line)/2: line_is_garbage()

well. Or rude regexpish s/[!,'"@#~$%^& ]{5,}//g

maykeye
what about this line: , i
Breton
Nothing. Remove it by hand later. Don't expect heuristics to remove all of the garbage. Proverb of the day: "Don't throw the baby out with the bath water".
maykeye
+1  A: 

Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there: http://www.neurogy.com/ocrpreproc.html

You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.

DmitryK
A: 

I'd claim a regex like "any punctuation followed by anything except a space is spam'.

So in .NET it's possibly something like

.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");

Then you'd consider "any word with two or more punctuations consecutively:

.Replace(" \\p{2,} ", "");

Seems like a good start anyway.

Noon Silk
> I'd claim a regex like "any punctuation followed by anything except a space is spam'.Not always, though. Some last names have hyphens. Not only last names(consider forget-me-not). Also "quotes" start with punctuation.
maykeye
True; then don't include the double quote in that part of the regex. I don't think he's looking for a foolproof system; just something to do a 'first cut'.
Noon Silk
I disagree .... :-)
Stephen C
+2  A: 

A simple heuristic, similar to anonymous answer:

listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!@$%^&...] // other symbols

Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )

if Na/Nb <= garbage_ratio then
  // garbage
Nick D
This assumes that the whole line is either garbage or it isn't, but based on the sample, this is a reasonable thing to assume.
Anton Geraschenko
yes, it is for filtering out whole lines. Removing garbage mixed with *normal* text won't be that simple :-)
Nick D
+1  A: 

No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain

Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.

Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.

Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.

There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.

It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.

Breton
A: 

I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.

([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})

http://www.regexpal.com/

JoshB
I'd add a dictionary comparison to make sure the words it finds are real and not just random letters.
faceless1_14
A: 

I like @Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).

Jim P
A: 

Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.

Jay
+1  A: 

Here is a Perl implementation of the garbage_ratio heuristic:

#!/usr/bin/perl

use strict;
use warnings;

while ( defined( my $chunk = read_chunk(\*DATA) ) ) {
    next unless length $chunk;

    my @tokens = split ' ', $chunk;
    # what is a word?
    my @words  = grep {
        /^[A-Za-z]{2,}[.,]?$/
            or /^[0-9]+$/
            or /^a|I$/
            or /^(?:[A-Z][.])+$/
    } @tokens;

    # completely arbitrary threshold
    my $score = @words / @tokens;
    print $chunk, "\n" if $score > 0.5;
}

sub read_chunk {
    my ($fh) = @_;
    my ($chunk, $line);

    while ( my $line = <$fh> ) {
        if( $line =~ /\S/ ) {
            $chunk .= $line;
            last;
        }
    }

    while (1) {
        $line = <$fh>;
        last unless (defined $line) and ($line =~ /\S/);
        $chunk .= $line;
    }

    return $chunk;
}


__DATA__

Paste the text above after __DATA__ above (not repeating the text here to save space). Of course, the use of the __DATA__ section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.

Output:

Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.

Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
Sinan Ünür