views:

373

answers:

2

The Problem:

I am getting the following traceback and don't understand what it means or how to fix it:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python26\lib\multiprocessing\forking.py", line 342, in main
    self = load(from_parent)
  File "C:\Python26\lib\pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "C:\Python26\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Python26\lib\pickle.py", line 1083, in load_newobj
    obj = cls.__new__(cls, *args)
TypeError: object.__new__(pyodbc.Cursor) is not safe, use pyodbc.Cursor.__new__()

The situation:

I've got a SQL Server database full of data to be processed. I'm trying to use the multiprocessing module to parallelize the work and take advantage of the multiple cores on my computer. My general class structure is as follows:

  • MyManagerClass
    • This is the main class, where the program starts.
    • It creates two multiprocessing.Queue objects, one work_queue and one write_queue
    • It also creates and launches the other processes, then waits for them to finish.
    • NOTE: this is not an extension of multiprocessing.managers.BaseManager()
  • MyReaderClass
    • This class reads the data from the SQL Server database.
    • It puts items in the work_queue.
  • MyWorkerClass
    • This is where the work processing happens.
    • It gets items from the work_queue and puts completed items in the write_queue.
  • MyWriterClass
    • This class is in charge of writing the processed data back to the SQL Server database.
    • It gets items from the write_queue.

The idea is that there will be one manager, one reader, one writer, and many workers.

Other details:

I get the traceback twice in stderr, so I'm thinking that it happens once for the reader and once for the writer. My worker processes get created fine, but just sit there until I send a KeyboardInterrupt because they have nothing in the work_queue.

Both the reader and writer have their own connection to the database, created on initialization.

Solution:

Thanks to Mark and Ferdinand Beyer for their answers and questions that led to this solution. They rightfully pointed out that the Cursor object is not "pickle-able", which is the method that multiprocessing uses to pass information between processes.

The issue with my code was that MyReaderClass(multiprocessing.Process) and MyWriterClass(multiprocessing.Process) both connected to the database in their __init__() methods. I created both these objects (i.e. called their init method) in MyManagerClass, then called start().

So it would create the connection and cursor objects, then try to send them to the child process via pickle. My solution was to move the instantiation of the connection and cursor objects to the run() method, which isn't called until the child process is fully created.

+2  A: 

The error is raised within the pickle module, so somewhere your DB-Cursor object gets pickled and unpickled (serialized to storage and unserialized to the Python object again).

I guess that pyodbc.Cursor does not support pickling. Why should you try to persist the cursor object anyway?

Check if you use pickle somewhere in your work chain or if it is used implicitely.

Ferdinand Beyer
Looks like multiprocessing uses it implicitly to pass things through Pipe objects between processes (specifically the Queue objects I created).
tgray
+3  A: 

Multiprocessing relies on pickling to communicate objects between processes. The pyodbc connection and cursor objects can not be pickled.

>>> cPickle.dumps(aCursor)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.5/copy_reg.py", line 69, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle Cursor objects
>>> cPickle.dumps(dbHandle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.5/copy_reg.py", line 69, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle Connection objects

"It puts items in the work_queue", what items? Is it possible the cursor object is getting passed as well?

Mark
I have a generator that loops over items in a cursor (basically calls pyodbyc.Cursor().fetchone()). I believe it yields a tuple of (id, stuff_to_process) which is what I put in the queue. I tried making a deepcopy, but that didn't work. I looked at the help and it's actually an instance of a Row object. So I may need to convert to a tuple first.
tgray
The Row object must contain a reference to the Cursor or something.
tgray
Converting the Row to a tuple didn't solve it.
tgray
Maybe it's because I create the reader and writer in MyManagerClass, and the connection/cursor are created in the __init__(), so technically they're created in the manager process, then piped to their own child process?
tgray
I'll try moving the instantiation of the connection to the Run method.
tgray
It worked! I'll update the question to reflect this.
tgray