views:

156

answers:

8

I was wondering what does it take to build a reverse language dictionary.

The user enters something along the lines of: "red edible fruit" and the application would return: "tomatoes, strawberries, ..."

I assume these results should be based on some form of keywords such as synonyms, or some form of string search.

This is an online implementation of this concept.

What's going on there and what is involved?

EDIT 1: The question is more about the "how" rather than the "which tool"; However, feel free to provide the tools you think to do the job.

+4  A: 

OpenCyc is a computer-usable database of real-world concepts and meanings. From their web site:

OpenCyc is the open source version of the Cyc technology, the world's largest and most complete general knowledge base and commonsense reasoning engine. OpenCyc can be used as the basis of a wide variety of intelligent applications

Beware though, that it's an enormously complex reasoning engine -- real-world facts never were simple. Documentation is quite sparse and the learning curve is steep.

intgr
Thank you, I've never heard of it before; I'll look into it. my question was more towards the how, rather than which tool.
dassouki
A: 

It should be fairly straightforward. You can use straight synonyms in addition to a series of words to define each word. The word order in the definition is sometimes important. Each word can have multiple definitions, of course.

You can develop a rating system to see which definitions are the closest match to the input, then display the top 3 or 4 words.

xpda
So the hard part, in your opinion is doing the keyword engine
dassouki
That would be really hard if you had to do it by hand. However, if you have access (and rights) to a dictionary, you can read it into a database and use that. That will take the most design work, though.
xpda
A: 

This sounds like a job for Prolog.

leppie
I would say it should be done with a comptuer ;)
Janusz
the question is more about the "how" rather than the "which tool", thanks for the input though
dassouki
The tool would be a start....
leppie
+1  A: 

First, there must be some way of associating concepts (like 'snow') with particular words.

So rather than simply storing a wordlist, you would also need to store concepts or properties like "red", "fruit", and "edible" as well as the keywords themselves, and model relationships between them.

At a simple level, you could have two tables (don't have to be database tables): a list of keywords, and a list of concepts/properties/adjectives, then you model the the relationship by storing another table which represents the mapping from keyword to adjective.

So if you have:

keywords:

0001  aardvark
....
0050  strawberry
....
0072  tomato
....
0120  zoo

and concepts:

0001  big
0002  small
0003  fruit
0004  vegetable
0005  mineral
0006  metal
....
0250  black
0251  blue
0252  red
....
0570  edible

you would need a mapping containing:

0050 -> 0003
0050 -> 0252
0050 -> 0570
0072 -> 0003
0072 -> 0252
0072 -> 0570

You may like to think of this as modelling an "is" relationship: 0050 (a strawberry) "is" 0003 (fruit), and "is" 0252 (red), and "is" 0570 (edible).

Nick Dixon
Thank you, so the hard part is building those relationships, and in a way, your suggestion seems like building a search engine. One day I thought of is actually having antonyms. so if you're looking for a "sad", right off the bat, it'll take all the words out that are related to "sad"'s antoynm.
dassouki
Antonynms can carry its own challenges to some extent as some words can have antonyms that aren't related,e.g. bitter and sour are both antonyms of sweet. Similarly, sad and mad are both antoynms of happy but what relation do they have is another question.
JB King
+3  A: 

Any approach would basically involve having a normalized database. Here is a basic example of what your database structure might look like:

// terms
+-------------------+
| id | name         |
| 1  | tomatoes     |
| 2  | strawberries |
| 3  | peaches      |
| 4  | plums        |
+-------------------+

// descriptions
+-------------------+
| id | name         |
| 1  | red          |
| 2  | edible       |
| 3  | fruit        |
| 4  | purple       |
| 5  | orange       |
+-------------------+

// connections
+-------------------------+
| terms_id | descript_id  |
| 1        | 1            |
| 1        | 2            |
| 1        | 3            |
| 2        | 1            |
| 2        | 2            |
| 2        | 3            |
| 3        | 1            |
| 3        | 2            |
| 3        | 5            |
| 4        | 1            |
| 4        | 2            |
| 4        | 4            |
+-------------------------+

This would be a fairly basic setup, however it should give you an idea how many-to-many relationships using a look-up table work within databases.

Your application would have to break apart strings and be able to handle normalizing the input for example getting rid of suffixes with user input. Then the script would query the connections table and return the results.

evolve
As I've said before, the daunting task becomes of building a keyword engine for 50k+ words
dassouki
A social method might be best, have users offer keywords, and then have moderators confirm them.
evolve
ya it's all about social media. The problem with that, is that I'll be riding on the dream that people will actually use the app
dassouki
You have to use it first, you'll end up contributing a lot of data to start. You gotta start somewhere or you won't get anywhere.
evolve
+1  A: 

How will your engine know that

  • "An incredibly versatile ingredient, essential for any fridge chiller drawer. Whether used for salads, soups, sauces or just raw in sandwiches, make sure they are firm and a rich red colour when purchased",
  • "mildly acid red or yellow pulpy fruit eaten as a vegetable", and
  • "an American musician who is known for being the lead singer/drummer for the alternative rock band Sound of Urchin"

all map to the same original word? Natural language definitions are unstructured, you can't store them in a normalized database. You can attempt to structure it by reducing to an ontology, like Princeton's WordNet, but creating and using ontologies is an extremely difficult problem, topic of phd theses and well funded advanced research.

Dustin Getz
That makes sense, but I guess the sentences you mentioned, although valid, fall a bit outside my scope. the same analogy could be about vague explanations, such as "big blue thing" (sky, sea, the monster from "monsters vs. aliens".
dassouki
+3  A: 

To answer the "how" part of your question, you could utilize human computation: There are hordes of bored teenagers with iPhones around the globe, so create a silly game whose byproduct is filling your database with facts -- to harness their brainpower for your purposes.

Sounds like an awkward concept at first, but look at this lecture on Human Computation for an example.

intgr
you're a genius
dassouki
this is highly dependent on bad spelling teenagers AND the hope that lots of people will download and use the app
dassouki
Both of these issues are addressed in that presentation."bad spelling teenagers" -- build the game such that the goal is validating others' factoids."people will download and use the app" -- create a web-based game
intgr
Have you ever heard of 20q.net? This is a perfect example of getting the masses to populate your database.
NickLarsen
A: 

what about using a dictionary, and performing a full-text search over the definitions (after removing link words and article, like 'and', 'or'...), then returning the word which has the best score (highest number of matching words or maybe a more complicated scoring method) ?

Adrien Plisson
that sounds great, but 2 different descriptions could lead to the same word.
dassouki
yes, but there is a lot of words which have multiple meaning, thus you will always have multiple definitions which may lead to the same word...
Adrien Plisson