Hello,
I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into groups such that all members in a group share the same original sentence.
I feel that using the entire sentence as a key is a bad idea. I felt that generating some hash value of the sentence may not work because of a limited number of keys (unjustified belief).
Can anyone recommend the best idea/practice for generating unique keys for each sentence? Ideally, I would like to preserve order. However, this isn't a main requirement.
Aντίο,