tags:

views:

182

answers:

1

I have a simple python script for fetching tweets and caching them to disk that is configured to run every two minutes via cron.

*/2 * * * * (date ; /usr/bin/python /path/get_tweets.py) >> /path/log/get_tweets.log 2>&1

The script runs successfully most of the time. However, every so often the script doesn't execute. In addition to other logging, I added a simple print statement above the meat of the script and nothing except the output from the initial date command makes it to the log.

#!/usr/bin/python
# Script for Fetching Tweets and then storing them as an HTML snippet for inclusion using SSI

print "Starting get_tweets.py"

import simplejson as json
import urllib2
import httplib
import re
import calendar
import codecs
import os
import rfc822
from datetime import datetime
import time
import sys
import pprint


debug = True 

now = datetime.today()
template = u'<p class="tweet">%s <span class="date">on %s</span></p>'
html_snippet = u''
timelineUrl = 'http://api.twitter.com/1/statuses/user_timeline.json?screen_name=gcorne&amp;count=7'
tweetFilePath = '/path/server-generated-includes/tweets.html'
if(debug): print "[%s] Fetching tweets from %s." % (now, timelineUrl)

def getTweets():
    request = urllib2.Request(timelineUrl)
    opener = urllib2.build_opener()
    try:
        tweets = opener.open(request)
    except:
        print "[%s] HTTP Request %s failed." % (now, timelineUrl)
        exitScript()
    tweets = tweets.read()
    return tweets

def exitScript():
    print "[%s] Script failed." % (now)
    sys.exit(0)


tweets = getTweets()
now = datetime.today()
if(debug): print "[%s] Tweets retrieved." % (now)
tweets = json.loads(tweets)

for tweet in tweets:
    text = tweet['text'] + ' '
    when = tweet['created_at']
    when = re.match(r'(\w+\s){3}', when).group(0).rstrip()
    # print GetRelativeCreatedAt(when)
    # convert links
    text = re.sub(r'(http://.*?)\s', r'<a href="\1">\1</a>', text).rstrip()
    #convert hashtags
    text = re.sub(r'#(\w+)', r'<a href="http://www.twitter.com/search/?q=%23\1"&gt;#\1&lt;/a&gt;', text)
    # convert @ replies
    text = re.sub(r'@(\w+)', r'@<a href="http://www.twitter.com/\1"&gt;\1&lt;/a&gt;', text)
    html_snippet += template % (text, when) + "\n"

#print html_snippet

now = datetime.today()
if(debug): print "[%s] Opening file %s." % (now, tweetFilePath)
try:
    file = codecs.open(tweetFilePath, 'w', 'utf_8')
except:
    print "[%s] File %s cound not be opened." % (now, tweetFilePath)
    exitScript()

now = datetime.today()
if(debug): print "[%s] Writing %s to disk." % (now, tweetFilePath)
file.write(html_snippet)

now = datetime.today()
if(debug): print "[%s] Finished writing %s to disk." % (now, tweetFilePath)
file.close()
sys.exit(0)

Any ideas? The system is a VPS running Centos 5.3 with python 2.4.

Update: I have added the entire script to avoid any confusion.

+2  A: 

The most likely explanation is that once in a while the script takes more than two minutes (maybe the system's very busy occasionally, or the script may have to wait for some external site that's occasionally busy, etc) and your cron's a sensible one that skips repeating events that haven't yet terminated. By logging the starting and ending times of your script, you'll be able to double check if that is the case. What you want to do in such circumstances is up to you (I recommend you consider skipping an occasional run to avoid further overloading a very busy system -- your own, or the remote one you're getting data from).

Alex Martelli
There is logging throughout the python script at various points, but in the instances where the process spawned by crond hangs none of the lines from the python script are logged or executed.
gcorne
@gcome, if you don't log the exact starting and ending time of the previous run, you can't check whether my hypothesis (that said ending time is simply after the otherwise-scheduled running time of the next run) is correct.
Alex Martelli
I have added the entire script, which shows where/how I was logging the various actions taken in the script. In my analysis of the logs, the times of the last event (completion of the write) are finished far before the start of the next cron start time. Nonetheless, I will see if your suggestion bears fruit.
gcorne