ansaurus

Question

Comparing list of strings with an available dictionary/thesaurus

Answer 1

+1 A:

You could download a list of words from the web (say one of the files mentioned here: http://www.outpost9.com/files/WordLists.html), then then do a quick:

// Read words from file.
string [] words = ReadFromFile();

Dictionary<String, List<String>> permuteDict = new Dictionary<String, List<String>>(StringComparer.OrdinalIgnoreCase);

foreach (String word in words) {
    String sortedWord = new String(word.ToArray().Sort());
    if (!permuteDict.ContainsKey(sortedWord)) {
        permuteDict[sortedWord] = new List<String>();
    }
    permuteDict[sortedWord].Add(word);
}

// To do a lookup you can just use

String sortedWordToLook = new String(wordToLook.ToArray().Sort());

List<String> outWords;
if (permuteDict.TryGetValue(sortedWordToLook, out outWords)) {
    foreach (String outWord in outWords) {
        Console.WriteLine(outWord);
    }
}

Moron 2010-02-11 23:46:12

Thanks. My main concern was where to get a list of words from,(whether there is a ready resource available), which preferably is quite an extensive representation of the English language.But your code has answered any further questions I'd have being ... "so how do I use it??"Thanks

Shaun 2010-02-11 23:51:46

Perhaps this will help: http://www.outpost9.com/files/WordLists.html

Moron 2010-02-11 23:57:21

+1 I'd go for this solution as it's likely to provide the best performance. I'd probably just stick each word in a HashSet<string>, though - since there's no 'value' here - just a set of words.

Andras Zoltan 2010-02-16 14:09:57

@Andras: Actually, each possible input could map to a list of words: like integral, triangle etc. We should actually be storing a list of words. I will change the code to refect that.

Moron 2010-02-16 16:52:58

Answer 2

A:

You can also use Wiktionary. The MediaWiki API (Wikionary uses MediaWiki) allows you to query for a list of article titles. In wiktionary, article titles are (among other things) word entries in the dictionary. The only catch is that foreign words are also in the dictionary, so you might get "incorrect" matches sometimes. Your user will also need internet access, of course. You can get help and info on the api at: http://en.wiktionary.org/w/api.php

Here's an example of your query URL:

http://en.wiktionary.org/w/api.php?action=query&amp;format=xml&amp;titles=dog|god|ogd|odg|gdo

This returns the following xml:

<?xml version="1.0"?>
<api>
  <query>
    <pages>
      <page ns="0" title="ogd" missing=""/>
      <page ns="0" title="odg" missing=""/>
      <page ns="0" title="gdo" missing=""/>
      <page pageid="24" ns="0" title="dog"/>
      <page pageid="5015" ns="0" title="god"/>
    </pages>
  </query>
</api>

In C#, you can then use System.Xml.XPath to get the parts you need (page items with pageid). Those are the "real words".

I wrote an implementation and tested it (using the simple "dog" example from above). It returned just "dog" and "god". You should test it more extensively.

public static IEnumerable<string> FilterRealWords(IEnumerable<string> testWords)
{
    string baseUrl = "http://en.wiktionary.org/w/api.php?action=query&amp;format=xml&amp;titles=";
    string queryUrl = baseUrl + string.Join("|", testWords.ToArray());

    WebClient client = new WebClient();
    client.Encoding = UnicodeEncoding.UTF8; // this is very important or the text will be junk

    string rawXml = client.DownloadString(queryUrl);

    TextReader reader = new StringReader(rawXml);
    XPathDocument doc = new XPathDocument(reader);
    XPathNavigator nav = doc.CreateNavigator();
    XPathNodeIterator iter = nav.Select(@"//page");

    List<string> realWords = new List<string>();
    while (iter.MoveNext())
    {
        // if the pageid attribute has a value
        // add the article title to the list.
        if (!string.IsNullOrEmpty(iter.Current.GetAttribute("pageid", "")))
        {
            realWords.Add(iter.Current.GetAttribute("title", ""));
        }
    }

    return realWords;
}

Call it like this:

IEnumerable<string> input = new string[] { "dog", "god", "ogd", "odg", "gdo" };
IEnumerable<string> output = FilterRealWords(input);

I tried using LINQ to XML, but I'm not that familiar with it, so it was a pain and I gave up on it.

Benny Jobigan 2010-02-15 11:52:29

I think that WCF with a WebHttpBinding should be used here for the web service call. It's pretty easy to do, and you would be able to get the result as a list of objects, which you could just then use LINQ-to-Objects on.

casperOne 2010-02-15 21:32:10

@casperOne. Ah, I never used WCF before, so I'm totally unfamiliar with it. WebClient and XPath were easy enough to do, however. I wrote the LINQ to XML first, basically using the same kind of logic as above, but the darned thing kept returning an ILinqQueryable or some other not-the-object-that-i-wanted thing. Is WCF easy to set up and use?

Benny Jobigan 2010-02-16 11:25:55

ansaurus

tags:

views:

answers:

Comparing list of strings with an available dictionary/thesaurus

related questions