views:

335

answers:

3

Hi All,

I would like to create an online dictionary application by using python (or with django).

It will be similar to http://dictionary.reference.com/.

PS: the dictionary is not stored in a database. it's stored in a text file or gunzip file. Free english dictionary files can be downloaded from this URL: dicts.info/dictionaries.php.

The easiest free dictionary file will be in the format of:

word1 explanation for word1 

word2 explanation for word2 

There are some other formats as well. but all are stored in either text file or text.gz file

My question is

(1) Are there any existing open source python package or modules or application which implements this functionality that I can use or study from?

(2) If the answer to the first question is NO. which algorithm should I follow to create such web application? Can I simply use the python built-in dictionary object for this job? so that the dictionary object's key will be the english word and the value will be the explanation. is this OK in term of performance? OR Do I have to create my own Tree Object to speed up the search? or any existing package which handles this job properly?

Thank you very much.

+3  A: 

I'm not sure 'What' functionality you are talking about. If you mean 'searching keywords from a dictionnary that is recorded in your database', then python dictionnary is not a possible solution, as you would have to deserialize your whole database in order to make a search.

You should rather look towards the django 'search' applications. A lot of people advise to use haystack :

http://stackoverflow.com/questions/55056/whats-the-best-django-search-app

and use this search engine to look for some keyword in your database.

If you don't want to support sophisticated searches, then you could also query for an exact keyword in your database

DictEntry.objects.get(keyword=`something`).definition

I guess it all depends on the level of sophistication you want to achieve, but there can be extremely simple solutions.

EDIT :

If the dictionnaries come from files, then it's hard to say, you have plenty of solutions.

If the file is small, you could indeed deserialize it to a dictionnary when starting the server, and then always search in the same instance (so you wouldn't have to deserialize again for each request).

If the files are really big, you could consider migrating them to your database.

1) First create your Django models, so you would know what data you need, the name of your fields, etc... for example :

class DictEntry(Model):
    keyword = CharField(max_length=100)
    definition = CharField(max_length=100)

2) It seems like some of the files on the link you gave are in csv format (it seems also like you can have them in xml). With the csv module from standard library, you could extract these files to python.

3) and then with the json or yaml python libraries, you dump these files back to a different format (json or yaml) as described in initial data for your model. And magic your initial data is ready !

PS : the good thing with python : you google 'python json' you will find the official doc because a library for writing/reading json is part of the standard python lib !!! Idem with xml and csv ...

sebpiq
I am still new on python and django. thanks a lot for the tips. :)
Dear sebpiq, can you please explain a little bit more on how to migrate them to database? I am still new on python. thanks.
Thanks a lot for the additional reply about migrating to database. :)
you're welcome :)
sebpiq
+1  A: 

A dictionary should be pretty small (by IT standards).

For performance, make sure that the dictionary is built in the module namespace:

Good:

 # build the dictionary
 english_dict = dict()
 for line in open(dict_file):
     # however you process the file:
     word,def = line.split(',')

     # put it in the dictionary
     english_dict[word] = def

 def get_definition(word):
     # should use english_dict.get(word,'undefined')
     if word in english_dict:
         return english_dict[word]
     else:
         return 'no definition'

Bad

 def get_definition(word):

     # build the dictionary
     english_dict = dict()
     for line in open(dict_file):
         # however you process the file:
         word,def = line.split(',')

         # put it in the dictionary
         english_dict[word] = def

     if word in english_dict:
         return english_dict[word]
     else:
         return 'no definition'

Or you could use pickle to save the dictionary (so it's faster to read in), or put it all in a database. It's up to you.

wisty
I am still new on python and django. thanks a lot for the tips. :)
OK, a few quick tips then: dictionaries ({} or dict()) are really fast for look-ups. Lists ([] or list()) are very slow to search. You need to know these two data structures.Putting stuff in the module name space means that you build the dictionary every time the django process runs, which should be less often than you call the function.Also, pickle is a good way to store python objects on the hard drive.
wisty
Thanks for new tips. Can I please ask one more question regarding word look-up? As you mentioned that python built-in dictionaries are fast for lookups. Is that sufficient and fast to use built-in dictionaries with pickle to do the word lookup? can it handle a large words database eg: 200MB or more? OR Is that better to use Haystack + Xapian search engine to do the lookup job mentioned by sebpiq? Since I don't have much experience on this, that's why I am asking here and would like to know where I should start with. To pick the right direction/method will save lots of time. ^_^ thanks a lot.
+2  A: 

You might want to check out http://www.nltk.org/ You could get lots of words and their definitions without having to worry about the implementation details of a database. If you're new to all this stuff, at the very least it would be useful to get you up and going, and then when you've got a working version, start putting in a database.

Here's a quick snippet of how to get all the available meanings of "dog" from that package:

from nltk.corpus import wordnet
for word_meaning in wordnet.synsets('dog'):
    print word_meaning.definition
Adam Morris
Thanks a lot for suggesting this python module. very interesting. I will have a look at it. I think I can check their source code to see and learn how they handle words lookup.