views:

521

answers:

2

For some game where one would need to find anagrams from a bunch of loose letters I ended up implementing a permutation algorithm to find all possible anagrams and filter those if needed for known letter positions (-match is great, by the way). But for longer words this proved very much error-prone, as skimming a large list of gibberish doesn't really reveal the proper words that were hidden within.

So I thought that if I would have a large list of English words (should be obtainable somewhere) I could just intersect my list of permutations with the list of proper words and get (hopefully) all real words from the permutation list.

Since many operators in PS work differently with collections I thought I could just do something like

$wordlist -contains $permlist

and get the intersection back. Unfortunately it's not that easy. Other options I have thought of would be to iterate over one list and do a -contains for each item:

$permlist | ? { $wordlist -contains $_ }

This probably would work but is also very slow, I think (especially when $wordlist is the result of a gc wordlist.txt). Or I could build a gigantic regular expression:

$wordlist -matches (($permlist | %{ "^$_`$" }) -join "|")

But that would probably not be very fast either. I could maybe also use findstr with above gigantic regex but that feels just wrong.

Are there any built-in solutions I could use and that are better than my attempts so far? Otherwise I'd probably put the word list into a hashtable and use the iterative -contains approach which should be fast enough then.

A: 

You could spell-check your list of words and eliminate all spelling errors against a standard dictionary.

With the GNU aspell package installed,

 cat text.txt | aspell list

will give you a list of all the miss-spelled words.
You can work with other dictionaries with aspell.


Or just pickup an anagram generator like this one made for Scrabble players.

The Revolution Word Finder has two options; an Anagram Finder and a Scrabble Solver. The Anagram Finder takes a list of letters and returns all the valid anagrams which can be created using them relative to a fixed list of words. Each anagram is checked for validity against the SOWPODS word list which is the word list used in current International Scrabble Tournaments.

nik
"You could spell-check your list of words and eliminate all spelling errors against a standard dictionary." Well, that's exactly what I was trying. However that doesn't tell me anything about how exactly to achieve this, sidestepping my question at least partially.
Joey
Sorry, I did not mean to side step your spell check point, have added a reference on what I meant. I was saying you have standard tools to get the match-list worked out.
nik
Hmm, right, though that's not very much a "Powershell built-in" solution. I could probably also coerce the Office spell checker to work but that's probably beyond what I'd be willing to do for that. Also giving me a list of mis-spelled words won't help me as I'd rather need a list of correctly-spelled words :) (The game in question is http://www.kongregate.com/games/Morpheme/blocks-with-letters-on and in some levels I struggled to even find the word I needed to construct, that's why I settled on brute-forcing every permutation and looking for words in the resulting list.
Joey
Your requirement is very close to Scrabble needs. And, once you get the miss-spelled words, inverting the result from the data set should not be very tough -- your critical differentiation is done. Meanwhile, I have not checked if `aspell` can give you the list of correctly spelled words -- that is not very likely.
nik
About the built-in need. Why try complicated algorithms in shell when you have packaged solution...
nik
Well, inverting the result would leave me with nearly exactly the same problem again :-). And well, why install something if you can have it in a few lines? I could build it all around the .NET HashSet type and be done with it, but maybe someone already came up with a good solution for arrays, as those are Powershell's default collections.
Joey
+3  A: 
$left = New-HashSet string
$left.Add("foo")
$left.Add("bar")
$right = New-HashSet string
$right.Add("bar")
$right.Add("baz")

$left.IntersectWith($right)
$left.UnionWith($right)

(borrowing New-HashSet from Josh Einstein)

Warning: those methods on HashSet are in-place algorithms that modify the original collection. If you want functional-style transform on immutable objects, you'll need to bring LINQ to the party:

add-type system.core

$asqueryable = [system.linq.queryable].getmethods() | ? { $_.name -eq "AsQueryable" } | select -first 1
$asqueryable = $asqueryable.MakeGenericMethod([string])
$leftAsQueryable = $asqueryable.Invoke($null, (,$left))

$intersect = [system.linq.queryable].getmethods() | ? { $_.name -eq "Intersect" } | select -first 1
$intersect = $intersect.MakeGenericMethod([string])
$result = $intersect.Invoke($null, ($leftAsQueryable, $right))

Clearly, someone needs to wrap this static-generic-reflection crap into a friendly cmdlet! Don't worry, I'm working on it...

Richard Berg
Ok, that would have been about my approach. Definitely not pretty. (And definitely not well-suited for unwrapped use from the cmdline itself).
Joey