tags:

views:

114

answers:

6

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!

One of the annoying things I've noticed is that entity names are entered differently:

For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.

I'm trying to catch all these variants.

So it would be something like:

"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"

Which isn't so bad... except there are about 40 entity suffixes that I can think of.

The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.

I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.

A: 

The first two "l" parts can be simplified by [the first "l" part here]{2}.

Eli Grey
+2  A: 

Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.

If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.

Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.

paxdiablo
You could create an array of regexes, one for each possible suffix, and cycle through them. That would be relatively readable, but still a bit difficult to maintain.
Chris Lutz
A: 

You can squish periods and whitespace first, before matching: for instance, in perl:

while (<>) {
  $Sq = $_;
  $Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
  $Sq = lc($Sq);
  /^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
  /^ibm/ and $_ = 'IBM'; # a different match
  print $_;
}
Alex Brown
If you're going to match against text, you can just use the `/i` modifier to ignore case. It's easier than remembering to `lc()` everything IMHO.
Chris Lutz
+2  A: 

You could just strip out excess crap. Using Perl:

my $suffix = "l. lc.."; # the worst case imaginable!

$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"

Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:

my $suffix = $full_name;

$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing

It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.

Chris Lutz
A: 

Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.

Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.

Alex Brown
This doesn't make sense. How would you build this map without a regex? By hand? That sounds fun to maintain. I can understand an aversion to cryptic regexes, but this is exactly the nightmare regexes were created to avoid.
Chris Lutz
I agree with Chris. Regexes are a function that describe such a mapping in less space.
James Thompson
yes, by hand. Humans will only enter certain variants in any case. There's also a limited set of cases. What he is proposing is transformation or interpretation of the input of users–manually checking this transformation is more honest than trusting to luck. You can optimise it to regexps later.
Alex Brown
A: 

In Perl you can build up regular expressions inside your program using strings. Here's some example code:

#!/usr/bin/perl

use strict;
use warnings;

my @strings = (
    "l.l.c",
    "llc",
    "LLC",
    "lLc",
    "l,l,c",
    "L . L C ",
    "l  W c"
);

my @seps = ('.',',','\s');
my $sep_regex = '[' . join('', @seps) . ']*';
my $regex_def = join '', (
    '[lL]',
    $sep_regex,
    '[lL]',
    $sep_regex,
    '[cC]'
);

print "definition: $regex_def\n";

foreach my $str (@strings) {
    if ( $str =~ /$regex_def/ ) {
     print "$str matches\n";
    } else {
     print "$str doesn't match\n";
    }
}

This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).

You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.

Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:

my $sep_regex = '[' . join('', @seps) . ']*';
foreach my $str (@strings) {
    my $copy = $str;
    $copy =~ s/$sep_regex//g;
$copy = lc $copy;
    print "$str -> $copy\n";
}

If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.

James Thompson
It seems rather silly to consider using `[lL]` when you could just as easily use `/i` at the end to make it all better. Most regex flavors have similar options to make case-insensitive matching easy.
Chris Lutz
@Chris - I was trying to balance portability with conciseness, but you're definitely right. I made a note of this in the answer.
James Thompson