views:

1404

answers:

7

Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)

Is it possible for this to go faster or is that a limitation on the language/database?

A: 

I typed python -m cProfile parser.py and got this back:

    1    0.000    0.000    0.000    0.000 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 __future__.py:48(<module>)
    1    0.000    0.000    0.000    0.000 __future__.py:70(_Feature)
    5    0.000    0.000    0.000    0.000 __future__.py:71(__init__)
    1    0.002    0.002    0.006    0.006 __init__.py:16(<module>)
    1    0.032    0.032    0.050    0.050 __init__.py:18(<module>)
    2    0.000    0.000    0.000    0.000 __init__.py:70(search_function)
    1    0.020    0.020    0.020    0.020 _strptime.py:12(<module>)
    1    0.000    0.000    0.000    0.000 _strptime.py:179(TimeRE)
    1    0.000    0.000    0.000    0.000 _strptime.py:182(__init__)
    5    0.000    0.000    0.000    0.000 _strptime.py:214(<genexpr>)
   50    0.000    0.000    0.000    0.000 _strptime.py:238(<genexpr>)
  455    0.001    0.000    0.001    0.000 _strptime.py:297(strptime)
    1    0.000    0.000    0.000    0.000 _strptime.py:34(LocaleTime)
    1    0.000    0.000    0.000    0.000 atexit.py:37(register)
    1    0.000    0.000    0.000    0.000 atexit.py:6(<module>)
    1    0.000    0.000    0.000    0.000 calendar.py:125(Calendar)
    1    0.000    0.000    0.000    0.000 calendar.py:131(__init__)
    1    0.000    0.000    0.000    0.000 calendar.py:137(setfirstweekday)
    1    0.000    0.000    0.000    0.000 calendar.py:20(IllegalMonthError)
    1    0.000    0.000    0.000    0.000 calendar.py:254(TextCalendar)
    1    0.000    0.000    0.000    0.000 calendar.py:27(IllegalWeekdayError)
    1    0.000    0.000    0.000    0.000 calendar.py:371(HTMLCalendar)
    1    0.000    0.000    0.000    0.000 calendar.py:46(_localized_month)
    1    0.000    0.000    0.000    0.000 calendar.py:483(TimeEncoding)
    1    0.000    0.000    0.000    0.000 calendar.py:495(LocaleTextCalendar)
    2    0.000    0.000    0.000    0.000 calendar.py:51(__init__)
    1    0.000    0.000    0.000    0.000 calendar.py:530(LocaleHTMLCalendar)
   26    0.000    0.000    0.000    0.000 calendar.py:54(__getitem__)
    1    0.000    0.000    0.000    0.000 calendar.py:6(<module>)
    1    0.000    0.000    0.000    0.000 calendar.py:65(_localized_day)
    2    0.000    0.000    0.000    0.000 calendar.py:70(__init__)
   14    0.000    0.000    0.000    0.000 calendar.py:73(__getitem__)
    2    0.000    0.000    0.000    0.000 codecs.py:77(__new__)
   14    0.000    0.000    0.000    0.000 copy.py:100(_copy_immutable)
    1    0.000    0.000    0.000    0.000 copy.py:112(_copy_with_constructor)
   15    0.000    0.000    0.000    0.000 copy.py:65(copy)
    1    0.001    0.001    0.001    0.001 core.py:18(<module>)
    1    0.000    0.000    0.001    0.001 core.py:34(full)
    1    0.009    0.009    0.018    0.018 decimal.py:116(<module>)
    1    0.000    0.000    0.000    0.000 decimal.py:154(DecimalException)
    1    0.000    0.000    0.000    0.000 decimal.py:177(Clamped)
    1    0.000    0.000    0.000    0.000 decimal.py:190(InvalidOperation)
    1    0.000    0.000    0.000    0.000 decimal.py:214(ConversionSyntax)
    1    0.000    0.000    0.000    0.000 decimal.py:2219(_ContextManager)
    1    0.000    0.000    0.000    0.000 decimal.py:2234(Context)
    1    0.000    0.000    0.000    0.000 decimal.py:225(DivisionByZero)
    4    0.000    0.000    0.000    0.000 decimal.py:2254(__init__)
    1    0.000    0.000    0.000    0.000 decimal.py:243(DivisionImpossible)
    1    0.000    0.000    0.000    0.000 decimal.py:254(DivisionUndefined)
    1    0.000    0.000    0.000    0.000 decimal.py:267(Inexact)
    1    0.000    0.000    0.000    0.000 decimal.py:280(InvalidContext)
    1    0.000    0.000    0.000    0.000 decimal.py:2869(_WorkRep)
    1    0.000    0.000    0.000    0.000 decimal.py:294(Rounded)
    5    0.000    0.000    0.000    0.000 decimal.py:2983(_isinfinity)
    2    0.000    0.000    0.000    0.000 decimal.py:2991(_isnan)
    1    0.000    0.000    0.000    0.000 decimal.py:307(Subnormal)
    1    0.000    0.000    0.000    0.000 decimal.py:319(Overflow)
    1    0.000    0.000    0.000    0.000 decimal.py:357(Underflow)
    3    0.000    0.000    0.000    0.000 decimal.py:438(getcontext)
    1    0.000    0.000    0.000    0.000 decimal.py:507(Decimal)
    3    0.000    0.000    0.000    0.000 decimal.py:516(__new__)
    1    0.000    0.000    0.000    0.000 kdictproxy.py:108(compactdictproxy)
    1    0.001    0.001    0.001    0.001 kdictproxy.py:5(<module>)
    1    0.000    0.000    0.000    0.000 latin_1.py:13(Codec)
    1    0.000    0.000    0.000    0.000 latin_1.py:20(IncrementalEncoder)
    1    0.000    0.000    0.000    0.000 latin_1.py:24(IncrementalDecoder)
    1    0.000    0.000    0.000    0.000 latin_1.py:28(StreamWriter)
    1    0.000    0.000    0.000    0.000 latin_1.py:31(StreamReader)
    1    0.000    0.000    0.000    0.000 latin_1.py:34(StreamConverter)
    1    0.000    0.000    0.000    0.000 latin_1.py:8(<module>)
    1    0.000    0.000    0.000    0.000 new.py:5(<module>)
  149    0.000    0.000    0.000    0.000 parser.py:26(log_hostname)
    1    0.712    0.712    1.731    1.731 parser.py:4(<module>)
    5    0.000    0.000    0.000    0.000 parser.py:40(log_date)
  445    0.001    0.000    0.001    0.000 parser.py:46(log_person)
  445    0.000    0.000    0.000    0.000 parser.py:57(time_to_string)
  301    0.000    0.000    0.000    0.000 parser.py:60(sanitize)
    5    0.000    0.000    0.000    0.000 posixpath.py:192(isdir)
    5    0.000    0.000    0.000    0.000 posixpath.py:56(join)
    1    0.001    0.001    0.001    0.001 profiler.py:10(<module>)
    1    0.000    0.000    0.000    0.000 profiler.py:138(Profiler)
    1    0.000    0.000    0.000    0.000 profiler.py:141(run)
    1    0.000    0.000    0.000    0.000 profiler.py:153(start)
    1    0.000    0.000    0.000    0.000 profiler.py:237(FullCompiler)
    1    0.000    0.000    0.000    0.000 profiler.py:239(do_start)
    1    0.000    0.000    0.000    0.000 profiler.py:246(RunOnly)
    1    0.000    0.000    0.000    0.000 profiler.py:255(ChargeProfiler)
    1    0.000    0.000    0.000    0.000 profiler.py:271(ActiveProfiler)
    1    0.000    0.000    0.000    0.000 profiler.py:285(PassiveProfiler)
    1    0.000    0.000    0.000    0.000 profiler.py:329(ActivePassiveProfiler)
    1    0.000    0.000    0.000    0.000 profiler.py:35(go)
  149    0.000    0.000    0.000    0.000 re.py:126(match)
  905    0.001    0.000    0.001    0.000 re.py:136(sub)
    1    0.000    0.000    0.002    0.002 re.py:178(compile)
   44    0.000    0.000    0.000    0.000 re.py:196(escape)
    1    0.000    0.000    0.002    0.002 re.py:219(_compile)
   10    0.000    0.000    0.000    0.000 re.py:259(_subx)
    7    0.000    0.000    0.000    0.000 sre_compile.py:184(_compile_charset)
    7    0.000    0.000    0.000    0.000 sre_compile.py:213(_optimize_charset)
   12    0.000    0.000    0.000    0.000 sre_compile.py:24(_identityfunction)
    8    0.000    0.000    0.000    0.000 sre_compile.py:360(_simple)
    1    0.000    0.000    0.000    0.000 sre_compile.py:367(_compile_info)
 19/1    0.000    0.000    0.001    0.001 sre_compile.py:38(_compile)
    2    0.000    0.000    0.000    0.000 sre_compile.py:480(isstring)
    1    0.000    0.000    0.001    0.001 sre_compile.py:486(_code)
    1    0.000    0.000    0.002    0.002 sre_compile.py:501(compile)
   56    0.000    0.000    0.000    0.000 sre_compile.py:57(fixup)
   67    0.000    0.000    0.000    0.000 sre_parse.py:132(__len__)
  144    0.000    0.000    0.000    0.000 sre_parse.py:136(__getitem__)
   10    0.000    0.000    0.000    0.000 sre_parse.py:138(__setitem__)
   10    0.000    0.000    0.000    0.000 sre_parse.py:140(__getslice__)
   19    0.000    0.000    0.000    0.000 sre_parse.py:144(append)
 27/9    0.000    0.000    0.000    0.000 sre_parse.py:146(getwidth)
   12    0.000    0.000    0.000    0.000 sre_parse.py:184(__init__)
  158    0.000    0.000    0.000    0.000 sre_parse.py:188(__next)
   61    0.000    0.000    0.000    0.000 sre_parse.py:201(match)
  133    0.000    0.000    0.000    0.000 sre_parse.py:207(get)
   22    0.000    0.000    0.000    0.000 sre_parse.py:216(isident)
    5    0.000    0.000    0.000    0.000 sre_parse.py:222(isname)
    6    0.000    0.000    0.000    0.000 sre_parse.py:263(_escape)
  9/1    0.000    0.000    0.001    0.001 sre_parse.py:307(_parse_sub)
 10/1    0.000    0.000    0.001    0.001 sre_parse.py:385(_parse)
    1    0.000    0.000    0.001    0.001 sre_parse.py:669(parse)
   10    0.000    0.000    0.000    0.000 sre_parse.py:73(__init__)
    8    0.000    0.000    0.000    0.000 sre_parse.py:78(opengroup)
    8    0.000    0.000    0.000    0.000 sre_parse.py:89(closegroup)
   62    0.000    0.000    0.000    0.000 sre_parse.py:96(__init__)
    1    0.000    0.000    0.000    0.000 support.py:102(PsycoFrame)
    1    0.000    0.000    0.000    0.000 support.py:15(warning)
    7    0.000    0.000    0.000    0.000 support.py:170(patch)
    1    0.000    0.000    0.000    0.000 support.py:70(Frame)
    1    0.000    0.000    0.000    0.000 support.py:74(PythonFrame)
    1    0.002    0.002    0.003    0.003 support.py:9(<module>)
    1    0.007    0.007    0.007    0.007 threading.py:1(<module>)
    1    0.000    0.000    0.000    0.000 threading.py:152(Condition)
    1    0.000    0.000    0.000    0.000 threading.py:155(_Condition)
    1    0.000    0.000    0.000    0.000 threading.py:157(__init__)
    1    0.000    0.000    0.000    0.000 threading.py:271(_Semaphore)
    1    0.000    0.000    0.000    0.000 threading.py:318(_BoundedSemaphore)
    1    0.000    0.000    0.000    0.000 threading.py:333(_Event)
    1    0.000    0.000    0.000    0.000 threading.py:37(_Verbose)
    1    0.000    0.000    0.000    0.000 threading.py:383(Thread)
    2    0.000    0.000    0.000    0.000 threading.py:39(__init__)
    1    0.000    0.000    0.000    0.000 threading.py:392(__init__)
    1    0.000    0.000    0.000    0.000 threading.py:602(_Timer)
    1    0.000    0.000    0.000    0.000 threading.py:631(_MainThread)
    1    0.000    0.000    0.000    0.000 threading.py:633(__init__)
    1    0.000    0.000    0.000    0.000 threading.py:640(_set_daemon)
    1    0.000    0.000    0.000    0.000 threading.py:671(_DummyThread)
    1    0.000    0.000    0.000    0.000 threading.py:79(_RLock)
    1    0.000    0.000    0.000    0.000 traceback.py:1(<module>)
    1    0.000    0.000    0.000    0.000 tz.py:27(FixedOffsetTimezone)
    1    0.000    0.000    0.000    0.000 tz.py:6(<module>)
    1    0.000    0.000    0.000    0.000 tz.py:69(LocalTimezone)
    1    0.000    0.000    0.000    0.000 {_sre.compile}
    3    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x1f19a0}
    2    0.000    0.000    0.000    0.000 {built-in method acquire}
    2    0.000    0.000    0.000    0.000 {built-in method release}
    1    0.002    0.002    1.733    1.733 {execfile}
   29    0.000    0.000    0.000    0.000 {getattr}
    7    0.000    0.000    0.000    0.000 {globals}
    5    0.000    0.000    0.000    0.000 {hasattr}
   28    0.000    0.000    0.000    0.000 {isinstance}
   1176/1167    0.000    0.000    0.001    0.000 {len}
    5    0.000    0.000    0.000    0.000 {locals}
    1    0.000    0.000    0.000    0.000 {map}
    2    0.000    0.000    0.000    0.000 {max}
  192    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
  449    0.357    0.001    0.357    0.001 {method 'commit' of 'psycopg2._psycopg.connection' objects}
    1    0.000    0.000    0.000    0.000 {method 'cursor' of 'psycopg2._psycopg.connection' objects}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  894    0.410    0.000    0.410    0.000 {method 'execute' of 'psycopg2._psycopg.cursor' objects}
   34    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
    1    0.000    0.000    0.000    0.000 {method 'has_key' of 'dict' objects}
    1    0.000    0.000    0.000    0.000 {method 'insert' of 'list' objects}
    5    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
  170    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
    1    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
    1    0.000    0.000    0.000    0.000 {method 'keys' of 'dictproxy' objects}
    7    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
    2    0.000    0.000    0.000    0.000 {method 'lstrip' of 'str' objects}
    5    0.101    0.020    0.101    0.020 {method 'readlines' of 'file' objects}
    8    0.000    0.000    0.000    0.000 {method 'remove' of 'list' objects}
  450    0.003    0.000    0.003    0.000 {method 'split' of 'str' objects}
  112    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
    1    0.000    0.000    0.000    0.000 {method 'toordinal' of 'datetime.date' objects}
    7    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
   40    0.000    0.000    0.000    0.000 {min}
    5    0.000    0.000    0.000    0.000 {open}
    8    0.000    0.000    0.000    0.000 {ord}
    1    0.000    0.000    0.000    0.000 {posix.listdir}
    1    0.000    0.000    0.000    0.000 {psyco._psyco.memory}
    1    0.000    0.000    0.000    0.000 {psyco._psyco.profiling}
    1    0.000    0.000    0.000    0.000 {psyco._psyco.statread}
    1    0.005    0.005    0.005    0.005 {psycopg2._psycopg.connect}
    1    0.000    0.000    0.000    0.000 {range}
   58    0.000    0.000    0.000    0.000 {setattr}
    4    0.000    0.000    0.000    0.000 {thread.allocate_lock}
    1    0.000    0.000    0.000    0.000 {thread.get_ident}
  455    0.061    0.000    0.083    0.000 {time.strptime}
   10    0.000    0.000    0.000    0.000 {time.time}
Ryan Bigg
+2  A: 

Use bind variables instead of literal values in the sql statements and create a cursor for each unique sql statement so that the statement does not need to be reparsed the next time it is used. From the python db api doc:

Prepare and execute a database operation (query or command). Parameters may be provided as sequence or mapping and will be bound to variables in the operation. Variables are specified in a database-specific notation (see the module's paramstyle attribute for details). [5]

A reference to the operation will be retained by the cursor. If the same operation object is passed in again, then the cursor can optimize its behavior. This is most effective for algorithms where the same operation is used, but different parameters are bound to it (many times).

ALWAYS ALWAYS ALWAYS use bind variables.

Mark Roddy
A: 

@Mark Roddy:

Could you show me an example please?

Ryan Bigg
+2  A: 

In the for loop, you're inserting into the 'chats' table repeatedly, so you only need a single sql statement with bind variables, to be executed with different values. So you could put this before the for loop:

insert_statement="""
    INSERT INTO chats(person_id, message_type, created_at, channel)
    VALUES(:person_id,:message_type,:created_at,:channel)
"""

Then in place of each sql statement you execute put this in place:

cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3)

This will make things run faster because:

  1. The cursor object won't have to reparse the statement each time
  2. The db server won't have to generate a new execution plan as it can use the one it create previously.
  3. You won't have to call santitize() as special characters in the bind variables won't part of the sql statement that gets executed.

Note: The bind variable syntax I used is Oracle specific. You'll have to check the psycopg2 library's documentation for the exact syntax.

Other optimizations:

  1. You're incrementing with the "UPDATE people SET chatscount" after each loop iteration. Keep a dictionary mapping user to chat_count and then execute the statement of the total number you've seen. This will be faster then hitting the db after every record.
  2. Use bind variables on ALL your queries. Not just the insert statement, I choose that as an example.
  3. Change all the find_*() functions that do db look ups to cache their results so they don't have to hit the db every time.
  4. psycho optimizes python programs that perform a large number of numberic operation. The script is IO expensive and not CPU expensive so I wouldn't expect to give you much if any optimization.
Mark Roddy
The above code does not work, it gives me a TypeError: 'd' is an invalid keyword argument for this function. Apparently the syntax isn't valid for psycopg2, and I specifically stated I was running on a PostgreSQL database.
Ryan Bigg
The cursor.execute( ) method takes slightly different arguments between different db api implementations. It likes the psycopg2 implementation takes bind variables in a dictionary. For example:
Mark Roddy
bind_vars={'person_id:'person','message_type':'msg','created_at':some_date,'channel':3)cursor.execute(sql_statement,bindvars)
Mark Roddy
+1  A: 

As Mark suggested, use binding variables. The database only has to prepare each statement once, then "fill in the blanks" for each execution. As a nice side effect, it will automatically take care of string-quoting issues (which your program isn't handling).

Turn transactions on (if they aren't already) and do a single commit at the end of the program. The database won't have to write anything to disk until all the data needs to be committed. And if your program encounters an error, none of the rows will be committed, allowing you to simply re-run the program once the problem has been corrected.

Your log_hostname, log_person, and log_date functions are doing needless SELECTs on the tables. Make the appropriate table attributes PRIMARY KEY or UNIQUE. Then, instead of checking for the presence of the key before you INSERT, just do the INSERT. If the person/date/hostname already exists, the INSERT will fail from the constraint violation. (This won't work if you use a transaction with a single commit, as suggested above.)

Alternatively, if you know you're the only one INSERTing into the tables while your program is running, then create parallel data structures in memory and maintain them in memory while you do your INSERTs. For example, read in all the hostnames from the table into an associative array at the start of the program. When want to know whether to do an INSERT, just do an array lookup. If no entry found, do the INSERT and update the array appropriately. (This suggestion is compatible with transactions and a single commit, but requires more programming. It'll be wickedly faster, though.)

Barry Brown
A: 

Additionally to the many fine suggestions @Mark Roddy has given, do the following:

  • don't use readlines, you can iterate over file objects
  • try to use executemany rather than execute: try to do batch inserts rather single inserts, this tends to be faster because there's less overhead. It also reduces the number of commits
  • str.rstrip will work just fine instead of stripping of the newline with a regex

Batching the inserts will use more memory temporarily, but that should be fine when you don't read the whole file into memory.

Torsten Marek
+5  A: 

Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.

Three Things.

One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.

Two. Don't Update.

Specifically, Do not do this. It's bad code for two reasons.

cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)

It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]

And. Never build SQL using string substitution. Completely dumb.

If, for some reason the SELECT COUNT(*) isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and insert this into a table of counts. Don't Update. Ever.

Three. Use Bind Variables. Always.

cursor.execute( "INSERT INTO ... VALUES( :x, :y, :z )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )

The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.

S.Lott
The update is done because later on (in a Ruby on Rails application) I want to instantly find how many lines a user has. This number theoretically should never be wrong, and will be faster. The bind variable syntax is wrong (so it seems for my psycopg2), got something that would work?
Ryan Bigg
Did you read my comment to my other answer about the psycopg2 specific bind variable syntax?
Mark Roddy
A select count(*) on a table that has an index (any index should suffice) is a very fast operation since the number of elements is stored directly in the index. Selecting a previously updated count shouldn't be any faster.
gooli
"want to instantly find how many lines a user has" Right. That's what a select count(*) will do. It will be fast. You can AFTER the load, cache these in a count table. After ALL the inserts, not during.
S.Lott
mind doing us a favor and removing the first code snippet there? I just had someone tell me they learned to use % on their sql from this comment. :(
Allen