Can a human do it?
farsidebag
far sidebag
farside bag
far side bag
Not only do you have to use a dictionary, you might have to use a statistical approach to figure out what's most likely (or, god forbid, an actual HMM for your human language of choice...)
For how to do statistics that might be helpful, I turn you to Dr. Peter Norvig, who addresses a different, but related problem of spell-checking in 21 lines of code:
http://norvig.com/spell-correct.html
(he does cheat a bit by folding every for loop into a single line.. but still).
Update This got stuck in my head, so I had to birth it today. This code does a similar split to the one described by Robert Gamble, but then it orders the results based on word frequency in the provided dictionary file (which is now expected to be some text representative of your domain or English in general. I used big.txt from Norvig, linked above, and catted a dictionary to it, to cover missing words).
A combination of two words will most of the time beat a combination of 3 words, unless the frequency difference is enormous.
I posted this code with some minor changes on my blog
http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/
and also wrote a little about the underflow bug in this code.. I was tempted to just quietly fix it, but figured this may help some folks who haven't seen the log trick before:
http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
Output on your words, plus a few of my own -- notice what happens with "orcore":
perl splitwords.pl big.txt words
answerveal: 2 possibilities
- answer veal
- answer ve al
wickedweather: 4 possibilities
- wicked weather
- wicked we at her
- wick ed weather
- wick ed we at her
liquidweather: 6 possibilities
- liquid weather
- liquid we at her
- li quid weather
- li quid we at her
- li qu id weather
- li qu id we at her
driveourtrucks: 1 possibilities
- drive our trucks
gocompact: 1 possibilities
- go compact
slimprojector: 2 possibilities
- slim projector
- slim project or
orcore: 3 possibilities
- or core
- or co re
- orc ore
Code:
#!/usr/bin/env perl
use strict;
use warnings;
sub find_matches($);
sub find_matches_rec($\@\@);
sub find_word_seq_score(@);
sub get_word_stats($);
sub print_results($@);
sub Usage();
our(%DICT,$TOTAL);
{
my( $dict_file, $word_file ) = @ARGV;
($dict_file && $word_file) or die(Usage);
{
my $DICT;
($DICT, $TOTAL) = get_word_stats($dict_file);
%DICT = %$DICT;
}
{
open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";
foreach my $word (<$WORDS>) {
chomp $word;
my $arr = find_matches($word);
local $_;
# Schwartzian Transform
my @sorted_arr =
map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map {
[ $_, find_word_seq_score(@$_) ]
}
@$arr;
print_results( $word, @sorted_arr );
}
close $WORDS;
}
}
sub find_matches($){
my( $string ) = @_;
my @found_parses;
my @words;
find_matches_rec( $string, @words, @found_parses );
return @found_parses if wantarray;
return \@found_parses;
}
sub find_matches_rec($\@\@){
my( $string, $words_sofar, $found_parses ) = @_;
my $length = length $string;
unless( $length ){
push @$found_parses, $words_sofar;
return @$found_parses if wantarray;
return $found_parses;
}
foreach my $i ( 2..$length ){
my $prefix = substr($string, 0, $i);
my $suffix = substr($string, $i, $length-$i);
if( exists $DICT{$prefix} ){
my @words = ( @$words_sofar, $prefix );
find_matches_rec( $suffix, @words, @$found_parses );
}
}
return @$found_parses if wantarray;
return $found_parses;
}
## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(@){
my( @words ) = @_;
local $_;
my $score = 1;
foreach ( @words ){
$score = $score * $DICT{$_} / $TOTAL;
}
return $score;
}
sub get_word_stats($){
my ($filename) = @_;
open(my $DICT, '<', $filename) or die "unable to open $filename\n";
local $/= undef;
local $_;
my %dict;
my $total = 0;
while ( <$DICT> ){
foreach ( split(/\b/, $_) ) {
$dict{$_} += 1;
$total++;
}
}
close $DICT;
return (\%dict, $total);
}
sub print_results($@){
#( 'word', [qw'test one'], [qw'test two'], ... )
my ($word, @combos) = @_;
local $_;
my $possible = scalar @combos;
print "$word: $possible possibilities\n";
foreach (@combos) {
print ' - ', join(' ', @$_), "\n";
}
print "\n";
}
sub Usage(){
return "$0 /path/to/dictionary /path/to/your_words";
}