views:

342

answers:

7

I am a PHP developer exploring the outside world. I have decided to start learning Python.

The below script is my first attempt at porting a PHP script to Python. Its job is to take tweets from a Redis store. The tweets are coming from Twitter's Streaming API and stored as JSON objects. Then the information needed is extracted and dumped into a CSV file to be imported into MySQL using the LOAD DATA LOCAL INFILE that is hosted on a different server.

So, the question is: Now that I have my first Python script running, how can I make it more Pythonic? Are there any suggestions that you guys have? Make it better? Tricks I should know about? Constructive Criticism?

Update: Having taken everyone's suggestions thus far, here is the updated version:
Update2: Ran the code through pylint. Now scores a 9.89/10. Any other suggestions?

# -*- coding: utf-8 -*-
"""Redis IO Loop for Tweelay Bot"""
from __future__ import with_statement

import simplejson
import re
import datetime
import time
import csv
import hashlib

# Bot Modules
import tweelay.red as red
import tweelay.upload as upload
import tweelay.openanything as openanything

__version__ = "4"

def process_tweets():
  """Processes 0-20 tweets from Redis store"""
  data = []
  last_id = 0
  for i in range(20):
    last = red.pop_tweet()
    if not last:
      break

    t = TweetHandler(last)
    t.cleanup()
    t.extract()

    if t.get_tweet_id() == last_id:
      break

    tweet = t.proc()
    if tweet:
      data = data + [tweet]
      last_id = t.get_tweet_id()

    time.sleep(0.01)

  if not data:
    return False

  ch = CSVHandler(data)
  ch.pack_csv()
  ch.uploadr()

  source = "http://bot.tweelay.net/tweets.php"
  openanything.openAnything(
    source,
    etag=None,
    lastmodified=None,
    agent="Tweelay/%s (Redis)" % __version__
    )

class TweetHandler:
  """Cleans, Builds and returns needed data from Tweet"""
  def __init__(self, json):
    self.json = json
    self.tweet = None
    self.tweet_id = 0
    self.j = None

  def cleanup(self):
    """Takes JSON encoded tweet and cleans it up for processing"""
    self.tweet = unicode(self.json, "utf-8")
    self.tweet = re.sub('^s:[0-9]+:["]+', '', self.tweet)
    self.tweet = re.sub('\n["]+;$', '', self.tweet)

  def extract(self):
    """Takes cleaned up JSON encoded tweet and extracts the datas we need"""
    self.j = simplejson.loads(self.tweet)

  def proc(self):
    """Builds the datas from the JSON object"""
    try:
      return self.build()
    except KeyError:
      if 'delete' in self.j:
        return None
      else:
        print ";".join(["%s=%s" % (k, v) for k, v in self.j.items()])
        return None

  def build(self):
    """Builds tuple from JSON tweet"""
    return (
    self.j['user']['id'],
    self.j['user']['screen_name'].encode('utf-8'),
    self.j['text'].encode('utf-8'),
    self.j['id'],
    self.j['in_reply_to_status_id'],
    self.j['in_reply_to_user_id'],
    self.j['created_at'],
    __version__ )

  def get_tweet_id(self):
    """Return Tweet ID"""
    if 'id' in self.j:
      return self.j['id']

    if 'delete' in self.j:
      return self.j['delete']['status']['id']


class CSVHandler:
  """Takes list of tweets and saves them to a CSV
     file to be inserted into MySQL data store"""
  def __init__(self, data):
    self.data = data
    self.file_name = self.gen_file_name()

  def gen_file_name(self):
    """Generate unique file name"""
    now = datetime.datetime.now()

    hashr = hashlib.sha1()
    hashr.update(str(now))
    hashr.update(str(len(self.data)))

    hash_str = hashr.hexdigest()
    return hash_str+'.csv'

  def pack_csv(self):
    """Save tweet data to CSV file"""
    with open('tmp/'+self.file_name, mode='ab') as ofile:
      writer = csv.writer(
        ofile, delimiter=',',
        quotechar='"',
        quoting=csv.QUOTE_MINIMAL)
      writer.writerows(self.data)

  def uploadr(self):
    """Upload file to remote host"""
    url = "http://example.com/up.php?filename="+self.file_name
    uploadr = upload.upload_file(url, 'tmp/'+self.file_name)
    if uploadr[0] == 200:
      print "Upload: 200 - ("+str(len(self.data))+")", self.file_name
      print "-------"
      #os.remove('tmp/'+self.file_name)
    else:
      print "Upload Error:", uploadr[0]

if __name__ == "__main__":
  while True:
    process_tweets()
    time.sleep(1)
+1  A: 
  1. Every method variable name I've ever seen in Python was lowercase with no underscores. (I don't think this is a requirement and may not be standard practice.)
  2. You should really break up the logic into multiple, single-purpose methods.
  3. Take 2 a step further and create some classes to encapsulate related methods together.
Ryan Riley
What? Underscores are used all the time in Python. PEP 8 even explicitly says "Function names should be lowercase, with words separated by underscores as necessary to improve readability." Heck, just off the top of my head, since I was using it last night look at the threading module: http://docs.python.org/library/threading.html
Nicholas Knight
I have updated the code to include your suggestions about breaking down the code into methods and classes. I have left the underscores there as I find them easier to read.
Jayrox
@Nicholas Doh! You are correct. I was thinking of variable names. I'm still learning Python, myself. Sorry for the goof.
Ryan Riley
@ryan, looks like this is a good experience for the both of us :)
Jayrox
+5  A: 

Pythonic python does not use integer flow control very much. The idiom is almost always for item in container:. Also, I would use a class to hold a 'User object'. It will be a lot easier to use than simple container types likes lists and dictionaries (And arrange your code into to a more OO style.) You can compile reg-exes before hand for a little more performance.

class MyTweet(object):
  def __init__(self, data):
    # ...process json here
    # ...
    self.user = user

for data in getTweets():
  tweet = MyTweet(data)
nate c
Agreed on the integer flow control, and even when you think you need integer control (indexing into two lists) it's better to map or zip them them together before iterating through. This will raise an error somewhere that makes more sense and make your code more readable.
marr75
+2  A: 
# Bot Modules
import red #Simple Redis API functions
import upload #pycurl script to upload to remote server

If your app is going to be used and maintained, its better to pack all these modules in the package.

Daniel Kluev
what do you mean? do you mean put their functions into the same file? as the rest of the code?
Jayrox
No, I mean putting them in their own namespace, like `myapp.red`, `myapp.upload`, etc.Basically you just need to make directory `myapp`, put `__init__.py` there (it can be empty) and move your files there. Then you could import it as `import myapp.red`, or, in more common form, `from myapp.red import Class1, Class2, Class3, func1`. Moreover, you could also add setup.py later, and `install` the package. See http://docs.python.org/distutils/examples.html#pure-python-distribution-by-package
Daniel Kluev
Interesting, what if I don't currently plan on distributing the code? Should I still follow this pattern?
Jayrox
>Should I still follow this pattern? -- most likely. For example, if you define own `logging` module to do some logging, someone reading your code (or even you yourself half a year later) will see `import logging` and believe that you import `logging` package from the stdlib, while in fact you import local module. Using package's own namespace makes it clear whether its local module or outside one.
Daniel Kluev
ahh gotcha, good point.
Jayrox
ok made a folder with an empty file named `__init__.py`; moved the modules into that folder and imported the files using `import tweelay.red as red` sound right?
Jayrox
> import tweelay.red as red -- Yes, this is okay too.
Daniel Kluev
+18  A: 

Instead of:

  i=0
  end=20
  last_id=0
  data=[]
  while(i<=end):
    i = i + 1
    ...

code:

  last_id=0
  data=[]
  for i in xrange(1, 22):
    ...

Same semantics, more compact and Pythonic.

Instead of

if not last or last == None:

do just

if not last:

since None is false-ish anyway (so not last is True when last is None). In general, when you want to check if something isNone, codeis None, not== None`.

In

  if(j['id'] <> last_id):

lose the redundant parentheses and the obsolete <> operator and code instead

  if j['id'] != last_id:

and also remove the redundant parentheses from other if statements.

Instead of:

  if len(data) == 0:

code:

  if not data:

since any empty container is false-ish.

In

hash_str = str(hash.hexdigest())

code instead

hash_str = hash.hexdigest()

since the method already returns a string, making the str call redundant.

Instead of:

  for item in data:
    writer.writerow(item)

use

  writer.writerows(data)

which does the loop on your behalf.

Instead of

  ofile = open('tmp/'+file_name, mode='ab')
  ...
  ofile.close()    

use (in Python 2.6 or better, or in 2.5 by starting the module with

  from __future__ import with_statement

to "import from the future" the with statement feature):

  with open('tmp/'+file_name, mode='ab') as ofile:
    ...

which guarantees to do the close for you (including in cases where an exception might be raised).

Instead of

print "Upload Error: "+uploadr[0]

use

print "Upload Error:", uploadr[0]

and similarly for other print statements -- the comma inserts a space for you.

I'm sure there are more such little things, but these are a few that "jumped to the eye" as I was scanning your code.

Alex Martelli
awesome, thanks :)
Jayrox
+2  A: 

Instead of ....

  i=0
  end=20
  last_id=0
  data=[]
  while(i<=end):
    i = i + 1

you can use...

for i in range(20):

but overall, it's not very clear where this 20 comes from?? magic #?

fseto
20 is just a number I chose for a control limiter. not much more really.
Jayrox
+2  A: 

If you have a method that won't fit in the view pane you really want to shorten it. Say 15 lines or so. I see what looks like at least 3 methods: print_tweet, save_csv, and upload_data. It's a bit hard to say exactly what they should be named but there do seem to be three distinct sections of code that you should try to break out.

Paul Rubel
Yea, I was reading that in PEP8, but I'm not sure how else to break them down.
Jayrox
@jayrox: then read again what paulrubel wrote, he gave you the functional breakdown right here.
msw
@mswn, sorry wasnt clear. i was talking about the 80 chars per line part of breaking it down.
Jayrox
+2  A: 

Run your code through pylint.

jedi_coder
Awesome! Thanks, I wasn't aware of pylint :)
Jayrox