views:

411

answers:

3

Scenario:
I have a contact form on my web app, it gets alot of spam.
I am validating the format of email addresses loosely i.e. ^.+@.+\..+$
I am using a spam filtering service (defensio) but the spam scores returned are overlapping with valid messages. At a threshold of 0.4 some spam gets through and some customer's questions are wrongly thrown in a log and an error displayed.

All of the spam messages use fake email addresses e.g. [email protected]

Dedicated PHP5 Linux server in US, mysql, logging spam only, emailing the non spam messages (not stored).

Proposal: Use php's checkdnsrr(preg_replace(/^.+?@/, '', $_POST['email']), 'MX') to check the email domain resolves to a valid address, log to file, then redirect with an error for messages that don't resolve, proceed to the spam filter service as before for addresses that do resolve according to checkdnsrr().

I have read (and i am sceptical about this myself) that you should never leave this type of validation up to remote lookups, but why?

Aside from connectivity issues, where i will have bigger problems than a contact form anyway, is checkdnsrr going to encounter false positives/negatives?
Would there be some address types that wont resolve? gov addresses? ip email addresses?
Do i need to escape the hostname i pass to checkdnsrr()?

Solution: A combination of all three answers (wish i could accept more than one as a compound answer).

I am using:

$email_domain = preg_replace('/^.+?@/', '', $email).'.';
if(!checkdnsrr($email_domain, 'MX') && !checkdnsrr($email_domain, 'A')){
   //validation error
}

All spam is being logged and rotated. With a view to upgrading to a job queue at a later date.

Some comments were made about asking the mail server for the user to verify, i felt this would be too much traffic and might get my server banned or into trouble in some way, and this is only to cut out most of the emails that were being bounced back due to invalid server addresses.

http://en.wikipedia.org/wiki/Fqdn and

RFC2821
The lookup first attempts to locate an MX record associated with the name.
If a CNAME record is found instead, the resulting name is processed as if 
it were the initial name.
If no MX records are found, but an A RR is found, the A RR is treated as
if it was associated with an implicit MX RR, with a preference of 0,
pointing to that host.  If one or more MX RRs are found for a given
name, SMTP systems MUST NOT utilize any A RRs associated with that
name unless they are located using the MX RRs; the "implicit MX" rule
above applies only if there are no MX records present.  If MX records
are present, but none of them are usable, this situation MUST be
reported as an error.

Many thanks to all (especially ZoogieZork for the A record fallback tip)

+1  A: 

I see no harm doing a MX lookup with checkdnsrr() and I also don't see how false positives may appear. You don't need to escape the hostname, in fact you can use this technique and take it a little further by talking to the MTA and testing if the user exists at a given host (however this technique may and probably will get you some false positives in some hosts).

Alix Axel
Most SMTP hosts you can find in the wild will not respond well to VRFY commands (both always OK as well as always ERROR are responses you can expect). Using VRFY for validating addresses is highly discouraged.
Guss
+2  A: 

DNS lookups can be slow at times, depending on network traffic & congestion, so that's something to be aware of.

If I were in your shoes, I'd test it out and see how it goes. For a week or so, log all emails to a database or log file and include a field to indicate if it would be marked as spam or legitimate email. After the week is over, take a look at the results and see if it's performing as you would expect.

Taking this logging/testing approach gives you the flexibility to test it out and not worry about loosing customer emails.

I've gotten into the habit of adding an extra field to my forms that is hidden with CSS, if it's filled in I assume it's being submitted by a spam bot. I also make sure to use a name like "url" or "website_url" something that looks like a legitimate field name to a spam bot. Add a label for the field that says something like "Don't fill out this field" so if someone's browser doesn't render it correctly, they will know not to fill out the spam field. So far it's working very well for me.

bradym
Re: hidden field - Good idea!As for the logging - make sure you also log the time it took you to resolve the DNS record. You may find out that it takes too long and results in poor user experience.
Guss
I'm testing out the hidden field now, seems to work ok, though some... users are typing "Not sure what to put in this field"
Question Mark
If users are typing anything in the field, it's not being properly hidden. There may be a bug in your CSS that's not hiding the field properly. I usually do something like this:<span style="display:none;visibility:hidden;"> <label for="url"> Ignore this text box. It is used to detect spammers. If you enter anything into this text box, your message will not be sent. </label> <input type="text" id="url" name="url" size="1" value="" /> </span>I haven't seen any spam coming to the forms where I've implemented this for quite some time.
bradym
A: 

An MX Lookup is only part of the picture, if you want to ensure the email address is itself valid, then you need to attempt to send an email to that account.

The other possible scenario is, someone can be simply using hijacked email accounts from a compromised machine anyway. Of course, that is probably a little bit less likely to occur, but it still does.

There are email address validation libraries out there that do this, simply search for email validation.

All of this can be done asynchronously. I have this setup on my site in which case the email is saved in the database (for auditing purposes), a job queued, then when the job comes time to execute, any additional validation is performed at that point in time. It offloads the heavy lifting to another thread.

To the user, it appears as if the email was sent already, it was (it's in the database), and can be viewed internally, but the actual email won't get mailed out until that job executes which can be immediately or some set amount of time depending on the server load.

Walter

I like the idea of a validation job queue
Question Mark
It is a job queue, part of that job is to do validation. The problem with this model is, someone can enter an email thinking it is valid and sent, and then when it is processed later, the system will reject it.