tags:

views:

90

answers:

4

I’m a student of other discipline. I would like to make an executable program on my own. Suppose the program I would like to make is a small dictionary with a few thousands of words. In the hope of creating such a program in my Windows XP, I collected a compiler called Borland C++ V.5.02. I downloaded some books on C language like Programming Language (2nd Edition) by Brian W. Kernighan, Dennis M. Ritchie; Sams Teach Yourself C in 24 Hours, and Programming with C (2nd edition) by Byron S. Gottfried. I started reading those books but sooner I found there were no instruction how to make such a program or I was unable to understand from those huge contents. I’m expecting some instruction from you all telling how I should proceed. Please leave some comments to help me out create such type of program.

+2  A: 

C is not the friendliest language to learn on your own if you have no notions of computer architecture.

Maybe something like Python is more suitable?

I'm sure you can download lots of Python books :-)

Mau
A: 

Programming is not the easiest thing to do, despite some people think so. It takes a lot of time to learn and master so if you really considering creating your own application a lot of patience and time is required. If you want to use this application of yours to learn programming then I'd suggest find some tutorials for particular language and digg in. If it's something you'd need for your main discipline maybe you could hire somebody to create such app for you, or browse sourceforge.net for similar solution, or find commercial alternative.

And yes, it's hard in the beginning :)

Greg
A: 

If you want to learn C++ then you can use Microsoft Visual C++ Express which is free. Creating of executable program is pretty straightforward task. Look at Visual C++ Guided Tour

DixonD
+1  A: 

Welcome to programming. :)

It might be easier to think of your problem in small pieces:

  • How will you store your dictionary?
    • Plain text, any order -- simple to work with, slower
    • Plain text, sorted -- requires sorting the word list (easy with a sort utility, but you or your users have to remember to sort the list); faster
    • A binary version (say, 32-byte 'records' for words): much harder to edit, but fast lookups
    • A binary encoding of a tree structure, child nodes are allowed character transitions: requires tools to create, very fast lookups
    • clever hashing http://en.wikipedia.org/wiki/Bloom_filters can go very very quickly.

Once you've picked the storage (I suggest plain text, any order, as a good starting point) you'll need to figure out the algorithm:

  • For a given word, compare it against every single word in the file
  • So you'll need to read ever line in the file, one at a time (fgets in a loop)
  • Compare the word with the line (strcmp)
  • return 1 if found
  • return 0 if you reach the end of the file

Now, iterate this, once for each word in the input:

  • read in a line (fgets)
  • tokenize the string into words (strtok)
  • strip off punctuation (or ignore? or ...)
  • pass the word to the routine you wrote earlier

This is an awful dictionary program: for a dictionary of 100,000 words (my /usr/share/dict/words is 98,000, and I think the wordlist on OpenBSD systems is in the 150,000 range) and a document of 5,000 words, you'll run your inner loop roughly 250,000,000 times. That'll be slow on even fast machines. Which is why I mentioned all those much-more-complex data structures earlier -- if you want it fast, you can't do naive.

If you sort the word list, it'll be roughly 83,000 comparisons in your inner-loop.

(And now a small diversion: the look program supports a -b option to ask for a binary search; without the -b, it runs a linear search:

         'help'  'universe' 'apple'
linear    .040s    .054s     .058s
binary    .001s    .001s     .001s

In other words, if you're going to be doing 5,000 of these, sorted word list will give you much faster run times.)

If you build a finite-state machine (the tree structure), it'll be as many comparisons as your input word has characters, times 5000. That'll be a huge savings.

If you build the bloom filters, it'll be computing one or two hashes (which is some simple arithmetic on your characters, very quick) and then one or two lookups. VERY fast.

I hope this is helpful, at least the simpler versions shouldn't be hard to implement.

sarnold