views:

431

answers:

7

In order to be able to detect RT of a particular tweet, I plan to store hashes of each formatted tweet in the database.

What hashing algorithm should I use. Cryptic is of course not essential. Just a minimal way of storing a data as something which can then be compared if it is the same, in an efficient way.

My first attempt at this was by using md5 hashes. But I figured there can be hashing algorithms that are much more efficient, as security is not required.

A: 

Python's shelve module? http://docs.python.org/library/shelve.html

Dave
+6  A: 

Do you really need to hash at all? Twitter messages are short enough (and disk space cheap enough) that it may be better to just store the whole message, rather than eating up clock cycles to hash it.

Chris Upchurch
Well, It would be computationally expensive to compare a given 140 char string with thousands of such strings.I figured, querying the db with count(hash) is simpler and efficient. Corret me if I am wrong
Lakshman Prasad
If you always sort your tweets and use binary search it could be doable. If your database is really huge, use radix search. (Linear run-time, how cool is that?)
Georg
Retweets are frequently non-identical. A hash would be oblivious to this unless you run some kind of "normalizer" first.
crunchyt
+1  A: 

Well, tweets are only 140 characters long, so you could even store the entire tweet in the database...

but if you really want to "hash" them somehow, a simple way would be to just take the sum of the ASCII values of all the characters in the tweet:

sum(ord(c) for c in tweet)

Of course, whenever you have a match of hashes, you should check the tweets themselves for sameness, because the probability of finding two tweets that give the same "sum-hash" is probably non-negligible.

David Zaslavsky
Is there rather a simple hash which gives correct answer _almost always_
Lakshman Prasad
+2  A: 

I echo Chris' comment about not using a hash at all (your database engine can hopefully index 140-character fields efficiently).

If you did want to use a hash, MD5 would be my first choice as well (16 bytes), followed by SHA-1 (20 bytes).

Whatever you do, don't use sum-of-characters. I can't immediately come up with a function that would have more collisions (all anagrams hash the same), plus it's slower!

$ python -m timeit -s 'from hashlib import md5' 'd=md5("There once was a man named Michael Finnegan.").digest()'
100000 loops, best of 3: 2.47 usec per loop
$ python -m timeit 'd=sum(ord(c) for c in "There once was a man named Michael Finnegan.")'
100000 loops, best of 3: 13.9 usec per loop
Tim Hatch
You're right that "sum" is a terrible hashcode. But 140*255 is 35700, and on my system that only takes 16 bits to store ;-)
Steve Jessop
Right you are, fingers moved a little faster than brain there.
Tim Hatch
+4  A: 

I am not familiar with Python (sorry, Ruby guy typing here) however you could try a few things.

Assumptions: You will likely be storing hundreds of thousands of Tweets over time, so comparing one hash against "every record" in the table will be inefficient. Also, RTs are not always carbon copies of the original tweet. After all, the original author's name is usually included and takes up some of the 140 character limit. So perhaps you could use a solution that matches more accurately than a "dumb" hash?

  1. Tagging & Indexing

    Tag and index the component parts of the message in a standard way. This could include treating hashed #...., at-marked @.... and URL strings as "tags". After removing noise words and punctuation, you could also treat the remaining words as tags too.

  2. Fast Searching

    Databases are terrible at finding multiple group membership very quickly (I'll assume your using either Mysql or Postgresql, which are terrible at this). Instead try one of the free text engines like Sphinx Search. They are very very fast at resolving multiple group membership (i.e. checking if keywords are present).

    Using Sphinx or similar, we search on all of the "tags" we extracted. This will probably return a smallish result set of "potential original Tweets". Then compare them one by one using similarity matching algorithm (here is one in Python http://code.google.com/p/pylevenshtein/)

Now let me warmly welcome you to the world of text mining.

Good luck!

crunchyt
Of course, I have to "clean the tweet" of all the @words and punctuation. But rather than tagging, grouping, wouldn't it be simpler to generate some unique value that I can query the database as count(hash)
Lakshman Prasad
Have you analyzed a sample of RTs and confirmed they are mostly identical? If you can rely on this, a hash will be simpler. But my swift wild-arsed guess is maybe 10-20% of RTs are non-indentical to the original. If you need high accuracy, then get a meaningful random sample (1000-10000) of Tweets that look like RT (i.e. starts with "RT @....", "via @....", "Retweet @...." or "@... said") and measure how closely they match the original?If accuracy is not so important, save time and just hash it. I had an idea for fast hash lookups too, so I'll put that below. :D
crunchyt
A: 

You are trying to hash a string right? Builtin types can be hashed right away, just do hash("some string") and you get some int. Its the same function python uses for dictonarys, so it is probably the best choice.

THC4k
Doesn't that produce a 32bit value, though? I think this application needs more collision-resistance than that, since he's planning to discard the message and rely only on the hash. With 32bit values you'd expect a collision within 65k tweets, which is like half an hour of Stephen Fry.
Steve Jessop
+2  A: 

There are a few issues here. First, RT's are not always identical. Some people add a comment. Others change the URL for tracking. Others add in the person that they are RT'ing (which may or may not be the originator).

So if you are going to hash the tweet, you need to boil it down to the meat of the tweet, and only hash that. Good luck.

Above, someone mentioned that with 32-bits, you will start having collisions at about 65K tweets. Of course, you could have collisions on tweet #2. But I think the author of that comment was confused, since 2^16 = ~65K, but 2^32 = ~4 Trillion. So you have a little more room there.

A better algorithm might be to try to derive the "unique" parts of the tweet, and fingerprint it. It's not a hash, it's a fingerprint of a few key words that define uniqueness.

Yes, I think dropping stop words and creating some sort of word frequency fingerprint is the way to go here.
Luke Francl