ansaurus

Question

How to separate words in a "sentence" with spaces?

Answer 1

A:

Given that some words could be substrings of others, especially with multiple words smashed together, I think simple solutions like regexes are out. I'd go with a full-on parser, my experience being with ANTLR. If you want to stick with perl, I've had good luck using ANTLR parsers generated as Java through Inline::Java.

bemace 2010-10-04 15:32:45

@bemace: Learning ANTLR, writing the grammar, and then running it would take about 8 hours. Thanks, though!

Dave Jarvis 2010-10-04 16:10:43

No denying that. I put it off for much the same reason, but for somethings there's just no other way to get there. Sounds like in this case you're in the gray area where a parser would be better, but you can still get by without one.

bemace 2010-10-04 16:16:58

Answer 2

+1 A:

The following shell script leverages Google:

#!/bin/bash

for i in $(cat columns.txt); do
  echo -n "$i : "
  lynx -dump http://www.google.com/search?q=$i | \
    grep "Did you mean to search for" | \
    awk '{ for( i = 7; i <= NF; i++ ) printf( "%s ", $i ) }'
  echo
  sleep 1
done

Dave Jarvis 2010-10-04 15:45:45

Answer 3

+10 A:

Sometimes, bruteforcing is acceptable:

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
    map quotemeta,
    sort { length $b <=> length $a || $a cmp $b }
    grep { 2 < length }
    (@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
    my @stack;
    print "$identifier : ";
    while ( $identifier =~ s/($re)\z// ) {
        unshift @stack, $1;
    }
    # mark suspicious cases
    unshift @stack, '*', $identifier if length $identifier;
    print "@stack\n";
}

Output:

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period

See also A Spellchecker Used to Be a Major Feat of Software Engineering.

Sinan Ünür 2010-10-04 16:02:36

Note: Requires `libfile-slurp-perl`.

Dave Jarvis 2010-10-04 16:14:10

@Sinan, @Dave: Corner case: `benefitactioncodedesc : taction code desc`

Axeman 2010-10-04 19:46:39

@Axeman I do expect there to be corner cases, I get `benefitactioncodedesc : benefit action code desc` because my `words` file does not include **taction** ;-) I guess the right thing to do is to handle the case where some of the original string is not consumed. For example, `ageconrolagelimit : age limit` I guess that should have been **control** rather than **conrol**

Sinan Ünür 2010-10-04 21:33:55

@Sinan: Conrol should be control. For the record, not my database design or implementation. ;-)

Dave Jarvis 2010-10-04 21:37:38

If the list file is one identifier per line, just do `chomp(my @identifiers = <> );`.

Sinan Ünür 2010-10-04 22:31:32

@Dave A more refined solution, allowing you to (1) read the identifiers from `STDIN` or an external file, (2) specify a custom dictionary file, and (3) specify an output file or output to `STDOUT` is now available on my blog: http://blog.nu42.com/2010/10/sometimes-brute-force-solution-is.html

Sinan Ünür 2010-10-05 00:03:40

@Sinan: I didn't know what it was, but it was in our words file. http://dictionary.reference.com/browse/taction I'm thinking we probably have more medical-related words in that file.

Axeman 2010-10-05 03:28:22

Instead of rummaging through the file, you might want to search for the words right after the space first. It might be more intensive, but doing a recursive iteration will be more correct and hopefully avoid cases like your `taction`.

vol7ron 2010-10-05 04:05:15

@vol7ron: An interesting idea. Sinan's solution does not solve, for example, `benefitcalcrulecodedesc` and `bankaccountstatusdesc`.

Dave Jarvis 2010-10-05 13:22:08

@Sinan: Thank you for the refined version. I hope other people find it as useful as I do.

Dave Jarvis 2010-10-05 13:23:37

@vol7ron I am sorry, but I can figure out what you mean by "search for the words right after the space first" @Dave It does indeed solve `benefitcalcrulecodedesc` and `bankaccountstatusdesc` but you have to tell the script `calc` and `desc` are words.

Sinan Ünür 2010-10-05 13:46:02

@Sinan, sort of like a forward lookahead for the next word. basically don't go onto the next set until all words are found. `benefitcalcruledecidedesc`, start with words that begin with `b`, largest first. It finds benefit, but before saying it's correct, look for the next word in `calcruledecidedesc`. if at any point the future word does not exist, then you have to go back to the previous selected word. eg after `ruled` is selected there is no word possibility that begins with `eci..`, so go back and select `rule`, then you can go forward and check possibilities with `d` (`decide`).

vol7ron 2010-10-05 16:26:18

@vol7ron Feel free to code that solution. Let's assume there is no string `benefitaccounteditembed` or something like that in the source data. Is that `benefit`, `accounted`, `item`, `bed` or `benefit`, `account`, `edit`, `embed`? In any case, I cannot see a way to make sure *a priori* there won't be any problem inputs. However, this method will get you most of the way there with the rest being handled by Dave's intern.

Sinan Ünür 2010-10-05 16:51:47

@Sinan, yes I'm not downing the above, just saying that recursively iterating by section would be more accurate than going by the dictionary. As for your `benefitaccounteditembed` example above (great example btw), I would say the second would come first. Unless you chained through all possibilities (very cpu intensive) and you would still need a mechanism that weighted the values/words. If you didn't want to weight the words individually, you could set up a scheme that checked the min length and chose the combos that had the largest min (or something silly like that).

vol7ron 2010-10-05 21:49:02

@Sianan, as for coding it, I rather leave that to you :) --- second thought to the above... store all possibilities and let the user decide.

vol7ron 2010-10-05 21:49:48

@Sinan: Was thinking about the problem a bit more (there are only 600 lines that need attention: a far cry from the original 2000, so thank you). benefitaccounteditembed with a dictionary without tense conjugations yields: "benefit account edit embed" and "benefit account ed it embed". The most likely answer is the phrase with the longest words. At this point it is academic, though. I combined your results with the Google "Did You Mean" technique to reduce the set to a manageable 600.

Dave Jarvis 2010-10-06 20:24:35

@Dave Thanks for the feedback. I would have hoped for a better than 70% success rate, but, oh well. I saw your code golf question (and this really is not suitable for golfing), but I would recommend posing this as a general algorithm question. Maybe there is a more suitable dictionary than `words` for this purpose. My main purpose in writing this script was to see how well creating one giant regex out of the entire contents of `words` performed (i.e. using the least possible developer time).

Sinan Ünür 2010-10-06 20:44:23

@Sinan: You're welcome. The success rate would be better when combined with a secondary routine that splits otherwise undecipherable column names by scanning forward through the string (and using the longest matching words).

Dave Jarvis 2010-10-06 22:34:18

Answer 4

A:

I reduced your list to 32 atomic terms that I was concerned about and put them in longest-first arrangement in a regex:

use strict;
use warnings;

my $qr 
    = qr/ \G # right after last match
          ( distribution 
          | relationship 
          | beneficiary 
          | dependent 
          | subclass 
          | account
          | benefit 
          | address 
          | control 
          | history
          | percent 
          | action 
          | amount
          | conrol 
          | option 
          | period 
          | status 
          | class 
          | labor 
          | limit 
          | match 
          | notice
          | bank
          | code 
          | desc 
          | name 
          | role 
          | type 
          | age 
          | end 
          | pay
          | ps 
          )
    /x;

while ( <DATA> ) { 
    chomp;
    print;
    print ' -> ', join( ' ', m/$qr/g ), "\n";
}

__DATA__
payperiodmatchcode
labordistributioncodedesc
dependentrelationship
actionendoption
actionendoptiondesc
addresstype
addresstypedesc
historytype
psaddresstype
rolename
bankaccountstatus
bankaccountstatusdesc
bankaccounttype
bankaccounttypedesc
beneficiaryamount
beneficiaryclass
beneficiarypercent
benefitsubclass
beneficiaryclass
beneficiaryclassdesc
benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod

Axeman 2010-10-04 17:06:51

@Axeman: Thanks. You only have a sample of the list, though. The actual list has 2,000 items.

Dave Jarvis 2010-10-04 18:54:37

@Dave: If you have a data dictionary--one that allows "desc" to invariably stand for "description" then it is no problem. Since you don't give all 2000 items (and thank you), we can't possibly tell whether your complete set has "false" matches to longer words in Sinan's dictionary example, that you will have to tune out, thus a list of atomic terms is one way to control the process. You also can't tell how many you'll have to add like `'desc'` and `'ps'`, so you'll end up maintaining the list anyway. You can add baroque cases to your code, or you can maintain a data dictionary.

Axeman 2010-10-04 19:21:45

@Dave, in addition, I even added 'conrol' to the list--in case it meant anything.

Axeman 2010-10-04 19:49:17

@Axeman: I was thinking about doing a simple search/replace on the major items (like "desc"). There is an intern who can do the rest of the work by hand. Like I said, just looking for a 90% solution to chop 11 hours down to 1 hour.

Dave Jarvis 2010-10-04 22:12:08

Answer 5

+1 A:

Two things occur to me:

this just isn't a task you can confidently attack programmatically, because ... English words don't work like that, they're often made of other words, so, is a given string "reportage" or "report age"? "Timepiece" or "time piece"?
One way to do attack the problem would be to use anag which finds anagrams. After all, "time piece" is an anagram of "timepiece" ... now you just have to weed out the false positives.

AmbroseChapel 2010-10-05 12:17:54

@Ambrose: Not looking for perfection. Adding 6,000+ spaces to split words would take 11 hours. Reducing that to 1 hour is solving the problem.

Dave Jarvis 2010-10-05 13:10:14

Answer 6

A:

Here is a Lua program that tries longest matches from a dictionary:

local W={}
for w in io.lines("/usr/share/dict/words") do
    W[w]=true
end

function split(s)
    for n=#s,3,-1 do
        local w=s:sub(1,n)
        if W[w] then return w,split(s:sub(n+1)) end
    end
end

for s in io.lines() do
    print(s,"-->",split(s))
end

lhf 2010-10-07 20:01:59

ansaurus

tags:

views:

answers:

How to separate words in a "sentence" with spaces?

Background

Problem

Question

Ideas

Languages

Unnecessary Prefection

related questions