views:

41

answers:

1

I am using Avro 1.4.0 to read some data out of S3 via the Python avro bindings and the boto S3 library. When I open an avro.datafile.DataFileReader on the file like objects returned by boto it immediately fails when it tries to seek(). For now I am working around this by reading the S3 objects into temporary files.

I would like to be able to stream through any python object that supports read(). Can anybody provide advice?

A: 

I am not very clear on this and this may not be the answer. I was of the impression that

diter = datafile.DataFileReader(..) 

returns an iterator so that you could do the following

for data in diter:
    ....

Correct me, if I am wrong here.

Revisiting my answer:

You are right, datafile.DataFileReader does not play well with a reader for which seek would fail.

it uses avro.io.BinaryDecoder which accepts a reader.

class BinaryDecoder(object):
    """Read leaf values."""
    def __init__(self, reader):
        """
    reader is a Python object on which we can call read, seek, and tell.
    """
    self._reader = reader

What you can do is create your own reader class that does provide these functions - read , seek and tell but internally utilizes boto S3 library to read of data.

pyfunc
You are correct - for files. So "for data in datafile.DataFilereader(open("/tmp/f")): ..." will work. But I'm reading from a boto S3 stream that does not support seek(), and DataFileReader() tries to seek() first thing in order to read the header.
Spike Gronim
@Spike Gronim : You are correct. But looking at the source file, it expects a reader that would implement these functions - read, seek and tell. So I guess creating a reader that utilizes the stream reading but provides the functionality over these functions should be worth the try.
pyfunc