tags:

views:

340

answers:

6

Hi

In my program in C++ ...

User types in program string "foo".

I need to compare this string to my strings, in txt files to write: this string is noun! (or adjective...)

I got few TXT files - one file with nouns, 2-nd file with adjectives... but in each file is about 200.000 words.

How I can effectively compare this string "foo" with strings in my files?

What I need to use?

+13  A: 

Put your words in std::set<std::string> containers and do a lookup on them. This gives O(log n) time for an access, which is probably sufficient for what you are doing.

You can also use std::map<std::string, std::string> where the key is the word and the value is the class (e.g. "noun").

Tronic
How you think, read about 200.000 x 2 words in containers will be fast?
Kate83
@Kate: yes. 200k is nothing.
Andreas Bonini
std::map and std::set are highly optimized for lookup by key (they may use e.g. a red-black search tree internally), when you use c.find(key). Only a few comparisons will be required to find the correct node.
Tronic
Thank you, I will try this 2-th :)
Kate83
std::unordered_set or std::unordered_map may be better choices. Reading the words into the container should be OK whichever standard container you use, provided you don't reload the data for every search. The "perfect" data structure depends on usage - a trie (aka digital tree) is one option, a ternary tree is a little slower but more memory efficient, but the gains probably aren't sufficient to justify the development time.
Steve314
@Steve314: But those are nonstandard containers.
Billy ONeal
@Tronic: what is the algorithm that is used within Std:set and std:map. Are they binary search trees (just curious since you've told O(logn) time.
Bragboy
@Bragaadeesh - yes, they are balanced trees (usually red/black). @Steve314: unordered_set/unordered_map are in the TR1 namespace (or other nonstandard namespaces) since they are not yet part of the standard.
Joe
The standard doesn't specify which algorithm is used, it only specifies the efficiency (O(log n) for map/set and O(1) for unordered versions). The unordered versions may be directly in std namespace too, if the compiler is in C++0x mode (e.g. g++ -std=c++0x).
Tronic
@BillyONeal - your point being that I'm not allowed to discuss standard containers and alternative data structures in the same comment?
Steve314
@Steve314: No, perfectly allowed. But if the container is nonstandard, then that should be pointed out to the reader.
Billy ONeal
@BillyONeal - as I understand it, there are no standard library data structures - the constraints are strict enough that std::vector pretty much has to be an array, but std::map could be one of a number of balanced tree data structures. In any case, I thought "container" vs. "data structure" was a fairly clear terminology distinction. OTOH sometimes I try too hard to not be too pedantic, and assume things are obvious when they aren't. Maybe my bad - don't know.
Steve314
@Steve314: Yes, but when someone sees it in the `std` namespace one assumes it's part of the standard library, unless specified otherwise. If it's not there, then people come back and ask why their calls to `std::unordered_map` fail to compile.
Billy ONeal
@Steve314: BTW: I wasn't ripping on you (I upvoted the comment) but I wanted others reading this to know the whole story here ;)
Billy ONeal
I wanted to upvote this answer but I can't because of the last sentence. A `map<string, WordCategory>` would be so much more memory efficient, `WordCategory` being an enum.
Manuel
@Manuel: Or map<string, set<WordCategory> > for storing the several word categories to which one string may belong.
Daniel Daranas
A: 

Do you just need to confirm if it matches anything?

If so, use a Trie.

Soraz
I must told user, that his word is noun, adjective... or program don't know what is that word.
Kate83
Then use two Tries, one for nouns and one for adjectives.
MSalters
+13  A: 

Use TRIE data structure for this. You should need some memory for constructing the data structure. But your objective will be most efficient.

Bragboy
Thank you, I will try this 1-st :)
Kate83
OMG TRIE is brilliant. Sadly, this is something I may have reinvented if pressed hard enough for it.
Hamish Grubijan
I reinvented it in 1996 in C. The speed diff knocked my socks off (PC was a 486). Very cool. It was first written about in the late 60s I believe. Didn't know is was a real structure until I got curous a few years ago.If this is homework you'd really impress the teacher more than the built-in functions. If it's work your coworkers would make fun of you for wasting time reinventing the wheel!
FastAl
+1  A: 

I would recommend to use sqlite for your files instead.

You could create a CRC of each of the key values, and store the key and values (int) into a table. Create an index for the key field.

When you want to do a lookup you can take the CRC of the word, and do a lookup in the table.

Brian R. Bondy
Is the CRC creation for each word 1-1? If not, the keys may collide wouldnt they?
Bragboy
@Bragaadees With only 200,000 keys you would have a better chance of winning the lottery. You could even use crc-8 if you wanted. You could select all and do string comparison if 2 matched, but 2 probably would never match.
Brian R. Bondy
Bad idea. With CRC-32, birthday collisions are likely at 2^16 = 65536 keys. With 200.000 keys, a collision is almost guaranteed. Yes, the chance of any pair clashing is only 1 in 4 billion, but there are 40.000.000.000 key pairs.
MSalters
+1  A: 

A Radix tree will provide a better memory usage for strings than a 'regular' trie if you have a lot of strings with common roots/prefixes (which is probably the case for a dictionary i.e. words with many forms - although that would probably depend on the language).

Eugen Constantin Dinca
A: 

You can store the external file indexed as a btree or as chained hash tabled it would provide really fast lookup times and minimum seeks to locate the data.

anijhaw