views:

279

answers:

6

Background

Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion.

Problem

There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as:

payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass beneficiaryclass beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnoticeperiod

Question

How would you automatically change such names to:

  • pay period match code
  • labor distribution code desc
  • dependent relationship

Ideas

  • Use Google's Did you mean engine, however I think it violates their TOS:

    lynx -dump «url» | grep "Did you mean" | awk ...

Languages

Any language is fine, but text parsers such as Perl would probably be well-suited. (The column names are English-only.)

Unnecessary Prefection

The goal is not 100% perfection in breaking words apart; the following outcome is acceptable:

  • enrollmenteffectivedate -> Enrollment Effective Date
  • enrollmentenddate -> Enroll Men Tend Date
  • enrollmentrequirementset -> Enrollment Requirement Set

No matter what, a human will need to double-check the results and correct many. Whittling a set of 2,000 results down to 600 edits would be a dramatic time savings. To fixate on some cases having multiple possibilities (e.g., therapistname) is to miss the point altogether.

A: 

Given that some words could be substrings of others, especially with multiple words smashed together, I think simple solutions like regexes are out. I'd go with a full-on parser, my experience being with ANTLR. If you want to stick with perl, I've had good luck using ANTLR parsers generated as Java through Inline::Java.

bemace
@bemace: Learning ANTLR, writing the grammar, and then running it would take about 8 hours. Thanks, though!
Dave Jarvis
No denying that. I put it off for much the same reason, but for somethings there's just no other way to get there. Sounds like in this case you're in the gray area where a parser would be better, but you can still get by without one.
bemace
+1  A: 

The following shell script leverages Google:

#!/bin/bash

for i in $(cat columns.txt); do
  echo -n "$i : "
  lynx -dump http://www.google.com/search?q=$i | \
    grep "Did you mean to search for" | \
    awk '{ for( i = 7; i <= NF; i++ ) printf( "%s ", $i ) }'
  echo
  sleep 1
done
Dave Jarvis
+10  A: 

Sometimes, bruteforcing is acceptable:

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
    map quotemeta,
    sort { length $b <=> length $a || $a cmp $b }
    grep { 2 < length }
    (@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
    my @stack;
    print "$identifier : ";
    while ( $identifier =~ s/($re)\z// ) {
        unshift @stack, $1;
    }
    # mark suspicious cases
    unshift @stack, '*', $identifier if length $identifier;
    print "@stack\n";
}

Output:

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period

See also A Spellchecker Used to Be a Major Feat of Software Engineering.

Sinan Ünür
Note: Requires `libfile-slurp-perl`.
Dave Jarvis
@Sinan, @Dave: Corner case: `benefitactioncodedesc : taction code desc`
Axeman
@Axeman I do expect there to be corner cases, I get `benefitactioncodedesc : benefit action code desc` because my `words` file does not include **taction** ;-) I guess the right thing to do is to handle the case where some of the original string is not consumed. For example, `ageconrolagelimit : age limit` I guess that should have been **control** rather than **conrol**
Sinan Ünür
@Sinan: Conrol should be control. For the record, not my database design or implementation. ;-)
Dave Jarvis
If the list file is one identifier per line, just do `chomp(my @identifiers = <> );`.
Sinan Ünür
@Dave A more refined solution, allowing you to (1) read the identifiers from `STDIN` or an external file, (2) specify a custom dictionary file, and (3) specify an output file or output to `STDOUT` is now available on my blog: http://blog.nu42.com/2010/10/sometimes-brute-force-solution-is.html
Sinan Ünür
@Sinan: I didn't know what it was, but it was in our words file. http://dictionary.reference.com/browse/taction I'm thinking we probably have more medical-related words in that file.
Axeman
Instead of rummaging through the file, you might want to search for the words right after the space first. It might be more intensive, but doing a recursive iteration will be more correct and hopefully avoid cases like your `taction`.
vol7ron
@vol7ron: An interesting idea. Sinan's solution does not solve, for example, `benefitcalcrulecodedesc` and `bankaccountstatusdesc`.
Dave Jarvis
@Sinan: Thank you for the refined version. I hope other people find it as useful as I do.
Dave Jarvis
@vol7ron I am sorry, but I can figure out what you mean by "search for the words right after the space first" @Dave It does indeed solve `benefitcalcrulecodedesc` and `bankaccountstatusdesc` but you have to tell the script `calc` and `desc` are words.
Sinan Ünür
@Sinan, sort of like a forward lookahead for the next word. basically don't go onto the next set until all words are found. `benefitcalcruledecidedesc`, start with words that begin with `b`, largest first. It finds benefit, but before saying it's correct, look for the next word in `calcruledecidedesc`. if at any point the future word does not exist, then you have to go back to the previous selected word. eg after `ruled` is selected there is no word possibility that begins with `eci..`, so go back and select `rule`, then you can go forward and check possibilities with `d` (`decide`).
vol7ron
@vol7ron Feel free to code that solution. Let's assume there is no string `benefitaccounteditembed` or something like that in the source data. Is that `benefit`, `accounted`, `item`, `bed` or `benefit`, `account`, `edit`, `embed`? In any case, I cannot see a way to make sure *a priori* there won't be any problem inputs. However, this method will get you most of the way there with the rest being handled by Dave's intern.
Sinan Ünür
@Sinan, yes I'm not downing the above, just saying that recursively iterating by section would be more accurate than going by the dictionary. As for your `benefitaccounteditembed` example above (great example btw), I would say the second would come first. Unless you chained through all possibilities (very cpu intensive) and you would still need a mechanism that weighted the values/words. If you didn't want to weight the words individually, you could set up a scheme that checked the min length and chose the combos that had the largest min (or something silly like that).
vol7ron
@Sianan, as for coding it, I rather leave that to you :) --- second thought to the above... store all possibilities and let the user decide.
vol7ron
@Sinan: Was thinking about the problem a bit more (there are only 600 lines that need attention: a far cry from the original 2000, so thank you). benefitaccounteditembed with a dictionary without tense conjugations yields: "benefit account edit embed" and "benefit account ed it embed". The most likely answer is the phrase with the longest words. At this point it is academic, though. I combined your results with the Google "Did You Mean" technique to reduce the set to a manageable 600.
Dave Jarvis
@Dave Thanks for the feedback. I would have hoped for a better than 70% success rate, but, oh well. I saw your code golf question (and this really is not suitable for golfing), but I would recommend posing this as a general algorithm question. Maybe there is a more suitable dictionary than `words` for this purpose. My main purpose in writing this script was to see how well creating one giant regex out of the entire contents of `words` performed (i.e. using the least possible developer time).
Sinan Ünür
@Sinan: You're welcome. The success rate would be better when combined with a secondary routine that splits otherwise undecipherable column names by scanning forward through the string (and using the longest matching words).
Dave Jarvis
A: 

I reduced your list to 32 atomic terms that I was concerned about and put them in longest-first arrangement in a regex:

use strict;
use warnings;

my $qr 
    = qr/ \G # right after last match
          ( distribution 
          | relationship 
          | beneficiary 
          | dependent 
          | subclass 
          | account
          | benefit 
          | address 
          | control 
          | history
          | percent 
          | action 
          | amount
          | conrol 
          | option 
          | period 
          | status 
          | class 
          | labor 
          | limit 
          | match 
          | notice
          | bank
          | code 
          | desc 
          | name 
          | role 
          | type 
          | age 
          | end 
          | pay
          | ps 
          )
    /x;

while ( <DATA> ) { 
    chomp;
    print;
    print ' -> ', join( ' ', m/$qr/g ), "\n";
}

__DATA__
payperiodmatchcode
labordistributioncodedesc
dependentrelationship
actionendoption
actionendoptiondesc
addresstype
addresstypedesc
historytype
psaddresstype
rolename
bankaccountstatus
bankaccountstatusdesc
bankaccounttype
bankaccounttypedesc
beneficiaryamount
beneficiaryclass
beneficiarypercent
benefitsubclass
beneficiaryclass
beneficiaryclassdesc
benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod
Axeman
@Axeman: Thanks. You only have a sample of the list, though. The actual list has 2,000 items.
Dave Jarvis
@Dave: If you have a data dictionary--one that allows "desc" to invariably stand for "description" then it is no problem. Since you don't give all 2000 items (and thank you), we can't possibly tell whether your complete set has "false" matches to longer words in Sinan's dictionary example, that you will have to tune out, thus a list of atomic terms is one way to control the process. You also can't tell how many you'll have to add like `'desc'` and `'ps'`, so you'll end up maintaining the list anyway. You can add baroque cases to your code, or you can maintain a data dictionary.
Axeman
@Dave, in addition, I even added 'conrol' to the list--in case it meant anything.
Axeman
@Axeman: I was thinking about doing a simple search/replace on the major items (like "desc"). There is an intern who can do the rest of the work by hand. Like I said, just looking for a 90% solution to chop 11 hours down to 1 hour.
Dave Jarvis
+1  A: 

Two things occur to me:

  • this just isn't a task you can confidently attack programmatically, because ... English words don't work like that, they're often made of other words, so, is a given string "reportage" or "report age"? "Timepiece" or "time piece"?
  • One way to do attack the problem would be to use anag which finds anagrams. After all, "time piece" is an anagram of "timepiece" ... now you just have to weed out the false positives.
AmbroseChapel
@Ambrose: Not looking for perfection. Adding 6,000+ spaces to split words would take 11 hours. Reducing that to 1 hour is solving the problem.
Dave Jarvis
A: 

Here is a Lua program that tries longest matches from a dictionary:

local W={}
for w in io.lines("/usr/share/dict/words") do
    W[w]=true
end

function split(s)
    for n=#s,3,-1 do
        local w=s:sub(1,n)
        if W[w] then return w,split(s:sub(n+1)) end
    end
end

for s in io.lines() do
    print(s,"-->",split(s))
end
lhf