I am building a website in python/django and want to predict wether a user submission is valid or wether it is spam.
Users have an accept rate on their submissions, like this website has.
Users can moderate other users' submissions; and these moderations are later metamoderated by an admin.
Given this:
- the registered user A with an submission accept rate of 60% submits something.
- user B moderates A's post as a valid submission. However, user B is wrong 70% of the time.
- user C moderates A's post as spam. User C is usually right. If user C says something is spam/ no spam, this will be correct 80% of the time.
How can I predict the chance of A's post being spam?
Edit: I made a python script simulating this scenario:
#!/usr/bin/env python
import random
def submit(p):
"""Return 'ham' with (p*100)% probability"""
return 'ham' if random.random() < p else 'spam'
def moderate(p, ham_or_spam):
"""Moderate ham as ham and spam as spam with (p*100)% probability"""
if ham_or_spam == 'spam':
return 'spam' if random.random() < p else 'ham'
if ham_or_spam == 'ham':
return 'ham' if random.random() < p else 'spam'
NUMBER_OF_SUBMISSIONS = 100000
USER_A_HAM_RATIO = 0.6 # Will submit 60% ham
USER_B_PRECISION = 0.3 # Will moderate a submission correctly 30% of the time
USER_C_PRECISION = 0.8 # Will moderate a submission correctly 80% of the time
user_a_submissions = [submit(USER_A_HAM_RATIO) \
for i in xrange(NUMBER_OF_SUBMISSIONS)]
print "User A has made %d submissions. %d of them are 'ham'." \
% ( len(user_a_submissions), user_a_submissions.count('ham'))
user_b_moderations = [ moderate( USER_B_PRECISION, ham_or_spam) \
for ham_or_spam in user_a_submissions]
user_b_moderations_which_are_correct = \
[i for i, j in zip(user_a_submissions, user_b_moderations) if i == j]
print "User B has correctly moderated %d submissions." % \
len(user_b_moderations_which_are_correct)
user_c_moderations = [ moderate( USER_C_PRECISION, ham_or_spam) \
for ham_or_spam in user_a_submissions]
user_c_moderations_which_are_correct = \
[i for i, j in zip(user_a_submissions, user_c_moderations) if i == j]
print "User C has correctly moderated %d submissions." % \
len(user_c_moderations_which_are_correct)
i = 0
j = 0
k = 0
for a, b, c in zip(user_a_submissions, user_b_moderations, user_c_moderations):
if b == 'spam' and c == 'ham':
i += 1
if a == 'spam':
j += 1
elif a == "ham":
k += 1
print "'spam' was identified as 'spam' by user B and 'ham' by user C %d times." % j
print "'ham' was identified as 'spam' by user B and 'ham' by user C %d times." % k
print "If user B says it's spam and user C says it's ham, it will be spam \
%.2f percent of the time, and ham %.2f percent of the time." % \
( float(j)/i*100, float(k)/i*100)
Running the script gives me this output:
- User A has made 100000 submissions. 60194 of them are 'ham'.
- User B has correctly moderated 29864 submissions.
- User C has correctly moderated 79990 submissions.
- 'spam' was identified as 'spam' by user B and 'ham' by user C 2346 times.
- 'ham' was identified as 'spam' by user B and 'ham' by user C 33634 times.
- If user B says it's spam and user C says it's ham, it will be spam 6.52 percent of the time, and ham 93.48 percent of the time.
Is the probability here reasonable? Would this be the correct way to simulate the scenario?