views:

29

answers:

0

given my input data in userid,itemid format:

raw: {userid: bytearray,itemid: bytearray}

dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)

grpd = GROUP raw BY userid;

dump grpd;

(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})

I'd like to generate all of the combinations(order not important) of items within each group. I eventually intend on performing jaccard similarity on the items in my group.

ideally my the bigrams would be generated and then I'd FLATTEN the output to look like:

(A, (1,2))
(A, (1,3))
(A, (1,4))
(A, (2,3))
(A, (2,4))
(A, (3,4))
(B, (1,2))
(B, (2,3))
(B, (3,5))
(C, (1,5))

The letters ABC, which represent the userid, are not really necessary for the output, I'm just showing them for illustrative purposes. From there, I would count the number of occurrences of each bigram in order to compute jaccard. I'd love to know if anyone else is using pig for similar similarity calcs(sorry!) and have encountered this already.

I've looked at the NGramGenerator that's supplied with the pig tutorials but it doesn't really match what I'm trying to accomplish. I'm wondering if perhaps a python streaming UDF is the way to go.