C++ Need to compare one string to 200.000 words...

views:

340

answers:

+6 Q:

C++ Need to compare one string to 200.000 words...

In my program in C++ ...

User types in program string "foo".

I need to compare this string to my strings, in txt files to write: this string is noun! (or adjective...)

I got few TXT files - one file with nouns, 2-nd file with adjectives... but in each file is about 200.000 words.

How I can effectively compare this string "foo" with strings in my files?

What I need to use?

+13 A:

Put your words in std::set<std::string> containers and do a lookup on them. This gives O(log n) time for an access, which is probably sufficient for what you are doing.

You can also use std::map<std::string, std::string> where the key is the word and the value is the class (e.g. "noun").

Tronic 2010-02-28 19:14:28

How you think, read about 200.000 x 2 words in containers will be fast?

Kate83 2010-02-28 19:15:48

@Kate: yes. 200k is nothing.

Andreas Bonini 2010-02-28 19:21:30

std::map and std::set are highly optimized for lookup by key (they may use e.g. a red-black search tree internally), when you use c.find(key). Only a few comparisons will be required to find the correct node.

Tronic 2010-02-28 19:22:58

Thank you, I will try this 2-th :)

Kate83 2010-02-28 19:23:19

std::unordered_set or std::unordered_map may be better choices. Reading the words into the container should be OK whichever standard container you use, provided you don't reload the data for every search. The "perfect" data structure depends on usage - a trie (aka digital tree) is one option, a ternary tree is a little slower but more memory efficient, but the gains probably aren't sufficient to justify the development time.

Steve314 2010-02-28 19:24:24

@Steve314: But those are nonstandard containers.

Billy ONeal 2010-02-28 19:25:21

@Tronic: what is the algorithm that is used within Std:set and std:map. Are they binary search trees (just curious since you've told O(logn) time.

Bragboy 2010-02-28 19:34:46

@Bragaadeesh - yes, they are balanced trees (usually red/black). @Steve314: unordered_set/unordered_map are in the TR1 namespace (or other nonstandard namespaces) since they are not yet part of the standard.

Joe 2010-02-28 19:46:59

The standard doesn't specify which algorithm is used, it only specifies the efficiency (O(log n) for map/set and O(1) for unordered versions). The unordered versions may be directly in std namespace too, if the compiler is in C++0x mode (e.g. g++ -std=c++0x).

Tronic 2010-02-28 20:01:35

@BillyONeal - your point being that I'm not allowed to discuss standard containers and alternative data structures in the same comment?

Steve314 2010-03-01 00:20:35

@Steve314: No, perfectly allowed. But if the container is nonstandard, then that should be pointed out to the reader.

Billy ONeal 2010-03-01 01:20:18

@BillyONeal - as I understand it, there are no standard library data structures - the constraints are strict enough that std::vector pretty much has to be an array, but std::map could be one of a number of balanced tree data structures. In any case, I thought "container" vs. "data structure" was a fairly clear terminology distinction. OTOH sometimes I try too hard to not be too pedantic, and assume things are obvious when they aren't. Maybe my bad - don't know.

Steve314 2010-03-01 01:39:39

@Steve314: Yes, but when someone sees it in the `std` namespace one assumes it's part of the standard library, unless specified otherwise. If it's not there, then people come back and ask why their calls to `std::unordered_map` fail to compile.

Billy ONeal 2010-03-01 02:15:07

@Steve314: BTW: I wasn't ripping on you (I upvoted the comment) but I wanted others reading this to know the whole story here ;)

Billy ONeal 2010-03-01 02:29:25

I wanted to upvote this answer but I can't because of the last sentence. A `map<string, WordCategory>` would be so much more memory efficient, `WordCategory` being an enum.

Manuel 2010-03-01 07:03:48

@Manuel: Or map<string, set<WordCategory> > for storing the several word categories to which one string may belong.

Daniel Daranas 2010-03-01 14:18:35

Do you just need to confirm if it matches anything?

If so, use a Trie.

Soraz 2010-02-28 19:14:36

I must told user, that his word is noun, adjective... or program don't know what is that word.

Kate83 2010-02-28 19:17:43

Then use two Tries, one for nouns and one for adjectives.

MSalters 2010-03-01 10:51:21

+13 A:

Use TRIE data structure for this. You should need some memory for constructing the data structure. But your objective will be most efficient.

Bragboy 2010-02-28 19:15:17

Thank you, I will try this 1-st :)

Kate83 2010-02-28 19:24:04

OMG TRIE is brilliant. Sadly, this is something I may have reinvented if pressed hard enough for it.

Hamish Grubijan 2010-02-28 20:02:45

I reinvented it in 1996 in C. The speed diff knocked my socks off (PC was a 486). Very cool. It was first written about in the late 60s I believe. Didn't know is was a real structure until I got curous a few years ago.If this is homework you'd really impress the teacher more than the built-in functions. If it's work your coworkers would make fun of you for wasting time reinventing the wheel!

FastAl 2010-06-07 16:10:54

+1 A:

I would recommend to use sqlite for your files instead.

You could create a CRC of each of the key values, and store the key and values (int) into a table. Create an index for the key field.

When you want to do a lookup you can take the CRC of the word, and do a lookup in the table.

Brian R. Bondy 2010-02-28 19:19:45

Is the CRC creation for each word 1-1? If not, the keys may collide wouldnt they?

Bragboy 2010-02-28 20:35:15

@Bragaadees With only 200,000 keys you would have a better chance of winning the lottery. You could even use crc-8 if you wanted. You could select all and do string comparison if 2 matched, but 2 probably would never match.

Brian R. Bondy 2010-02-28 21:26:55

Bad idea. With CRC-32, birthday collisions are likely at 2^16 = 65536 keys. With 200.000 keys, a collision is almost guaranteed. Yes, the chance of any pair clashing is only 1 in 4 billion, but there are 40.000.000.000 key pairs.

MSalters 2010-03-01 10:50:47

+1 A:

A Radix tree will provide a better memory usage for strings than a 'regular' trie if you have a lot of strings with common roots/prefixes (which is probably the case for a dictionary i.e. words with many forms - although that would probably depend on the language).

Eugen Constantin Dinca 2010-03-01 04:44:04

You can store the external file indexed as a btree or as chained hash tabled it would provide really fast lookup times and minimum seeks to locate the data.

anijhaw 2010-03-01 14:11:49

ansaurus

tags:

views:

answers:

C++ Need to compare one string to 200.000 words...

related questions