tags:

views:

37

answers:

3

Hello,

I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into groups such that all members in a group share the same original sentence.

I feel that using the entire sentence as a key is a bad idea. I felt that generating some hash value of the sentence may not work because of a limited number of keys (unjustified belief).

Can anyone recommend the best idea/practice for generating unique keys for each sentence? Ideally, I would like to preserve order. However, this isn't a main requirement.

Aντίο,

+1  A: 

Standard hashing should work fine. Most hash algorithms have a value space far greater than the number of sentences you're likely to be working with, and thus the likelihood of a collision will still be extremely low.

Amber
Can you give me rough figures for the value space? I need the application to scale, and would fear that I have a solution in my testing environment but a problem later on.
gnucom
SHA-1 outputs 160-bit hashes, which has a value space of 2^160 elements... I kind of doubt you're going to have more sentences than, oh, 2^40 or so (that'd be a terabyte for each character in the average sentence length).
Amber
A: 

Though you might want to avoid simple hash functions (for example, any half-baked idea that you could think up quickly) because they might not mix up the sentence data enough to avoid collisions in the first place, one of the standard cryptographic hash functions would probably be quite suitable, for example MD5, SHA-1, or SHA-256.

You can use MD5 for this, even though collisions have been found and the algorithm is considered unsafe for security intensive purposes. This isn't a security critical application, and the collisions that have been found arose through carefully constructed data and probably won't arise randomly in your own NLP sentence data. (See, for example Johannes Schindelin's explanation of why it's probably unnecessary to change git to use SHA-256 hashes, so that you can appreciate the reasoning behind this.)

Ken Bloom
+1  A: 

Despite the answer that I've already given you about what a proper hash function might be, I would really suggest you just use the sentences themselves as the keys unless you have a specific reason why this is problematic.

Ken Bloom