views:

201

answers:

5

A little bit of background: I'm building an inverted index for a search engine. I was originally using PHP, but because of the amount of times I needed to write to disk, I wanted to make a threaded indexer.

There's a problem with that because PHP is not thread safe. I then tried Java, but I ended up with at least 20 try catch blocks because of the JSON data structure I was using and working with files. The code was just too big and ugly.

Then I figured I should pick up some Python because it's flexible like PHP but also thread safe. Though I'm open to all criticism, what I'd like to learn is the shortcuts that the Python language/library provides that I skipped over. This is a PHP-afide Python script because all I really did was translate the PHP script line by line to what I thought was it's Python equivalent. Thanks.

http://pastebin.com/xrg7rf9w

+1  A: 

A really quick skim of the code makes me think you need to check out slicing in Python. http://stackoverflow.com/questions/509211/good-primer-for-python-slice-notation should tell you a lot of what you can do. In your code for instance, calls to get_file_name can be directly replaced with the slice token[lower:upper], and using negative indices removes the need to call len(token).

Nikwin
+8  A: 
  • Your stated goal is to build a threaded indexer. My most global critique is of that goal—threads are confusing and hard (even though they seem easy) and seldom buy you something worth it, especially in a Python app.

    If you want asynchronicity or concurrency, threads are not the only way to implement it. In my experience and seeing other Python code people have written, it's not the best way to accomplish these. If you are trying to run lots of asynchronous network requests, consider using Twisted. This has a strong concurrency model and several ways to help your code scale.

  • You made dictionary, categories, users, tweets and other things class attributes, meaning that they are created once for the class instead of for each instance. This means that when you mutate them, the change will take place for all instances of Indexer you have made rather than just the current instance of Indexer. If this was implemented to share state between threads, that is dangerous—it's very easy to do non-atomic operations and not realise it.

  • String formatting is often nicer to work with than repeated concatination. '<a href = "/index.php?query=' + user_name + '&type=u">' + display_user_name + '</a>' ==> '<a href = "/index.php?query=%s&type=u">%s</a>' % (user_name, display_user_name). For nontrivial formations like this, Python has some great templating engines that are widely used.

  • Python slices take negative indices to indicate counting from the end. I mean foo[1:len(foo) - 1] ==> foo[1:-1].

  • A lot of your attributes and parameters are named after a type like dictionary, tweet_array, array, or hash; it's usually nicer to have descriptive names which actually say what something is, not what it is. Most of these aren't even proper Python terms (array is a thing in Python, and not the one you're using; hash is a built-in function you're shadowing.)

  • You call close to close your files. This means you can't ensure they get closed if there is an exception. In 2.5+ I always use "with open(filename, mode) as f: if possible to ensure that my files get closed and in earlier Pythons I use try/finally if I want to be sure my file is closed.

  • The os.path module gives a robust, platform-agnosticish interface for manipulating paths.

  • sorted(existing_tweets[token]) does not do anything when you call it. To sort a list in place, use it's sort method (foo.sort()). sorted returns a new list which you do not do anything with (like existing_tweets[token] = sorted(existing_tweets[token]).)

  • Using os.path.exists usually introduces unnecessary race conditions. Consider just always opening the file to make sure it exists.

  • == True and == False are always silly and sometimes don't give the result wanted. Use if foo: and if not foo.

I'm sorry more of my critiques were not higher-level or more earth-shattering. I hope you are enjoying Python so far and being productive and efficient. It looks like you aren't having too much trouble expressing yourself.

Mike Graham
Yes I am aware that I've done nothing to make it threaded but I have not yet implemented any of the thread support (aside from the imports and run).The idea is to process 1000 tweets, write them to file, which is potentially require hundreds of file handler requests, and while it hangs on the IO the next thread will process another 1000 tweets.
tipu
Threads are not the only model for non-blocking IO. I don't really think they're the best, either. Check out the link to Twisted to encounter another that will let you scale more arbitrarily and safely and predictably than threads.
Mike Graham
+1  A: 

Three fast comments:

  1. You're shadowing the python token module in the for token in tokens loop / this is generally considered bad practice (although not harmful in this specific case). Note how you can see this indicated by the different syntax coloring in the pastebin.

  2. Use os.path.join to combine paths (instead of concatenating them with +)

  3. When writing to a file, use the with statement to make sure the file get closed automagically if an error occurs during the write with open('myfilename','w') as filehandle:

ChristopheD
I'll be checking out the token module.
tipu
A: 

Here's one thing I learned recently

http://pastebin.com/Mnmfs7zN

def main():
    i = Indexer()
    start = time.time()
    i.index()
    end = time.time()
    elapsed= end - start
    print (elapsed)            



if __name__ == "__main__":                     
    main()

Add a main function. "floating" code is executed when the module is loaded, by putting it in a function it will be executed only then the script is run.

Additionally, code in functions may be optimized ( there's not much difference here although )

OscarRyz
That's normally used for unit testing as well.
kprobst
Neat, will use, thanks.
tipu
A: 

Mike Graham's answer points out many of the issues, but there are a couple others:

Technically you haven't threaded anything. You imported the threading library but never actually used it. I think you meant to do the following (extend the Thread class):

class Indexer (threading.Thread):
# ... snip ...
i = Indexer()
i.start()  #instead of directly calling i.index()

Also, as Mike said, threading is not that straightforward. You will need some way to lock your input and output so that multiple threads do not attempt to access them at the same time.

Along the same lines, due to the Global Interpreter Lock (GIL), threading in python will only speed up your application if it is I/O bound, not if it's processor bound. You may want to consider using the multiprocessing module instead.

tgray
I understand nothing is yet threaded, I just completed the single threaded version and wanted to tidy it up.
tipu