views:

337

answers:

5

I'm writing a parser, and there is LOTS of text to decode but most of my users will only care about a few fields from all the data. So I only want to do the decoding when a user actually uses some of the data. Is this a good way to do it?

class LazyString(str):
    def __init__(self, v) :
        self.value = v
    def __str__(self) :
        r = ""
        s = self.value
        for i in xrange(0, len(s), 2) :
            r += chr(int(s[i:i+2], 16))
        return r

def p_buffer(p):
    """buffer : HASH chars"""
    p[0] = LazyString(p[2])

Is that the only method I need to override?

+1  A: 

I don't see any kind on lazy evaluation in your code. The fact that you use xrange only means that the list of integers from 0 to len(s) will be generated on demand. The whole string r will be decoded during string conversion anyway.

The best way to implement lazy sequence in Python is using generators. You could try something like this:

def lazy(v):
    for i in xrange(0, len(v), 2):
        yield int(v[i:i+2], 16)

list(lazy("0a0a0f"))
Out: [10, 10, 15]
Adam Byrtek
Well, I wanted to make the LazyString object, but not do the big for loop until people actually used it.
Paul Tarjan
It depends on your needs, but from your question I understood that you would like to decode just a part of the string that is needed - this can be done using generators.
Adam Byrtek
I actually want to only decode the string if the client code uses it. Maybe I wasn't very clear in my questions, sorry.
Paul Tarjan
A: 

What you're doing is built in already:

s =  "i am a string!".encode('hex')
# what you do
r = ""
for i in xrange(0, len(s), 2) :
    r += chr(int(s[i:i+2], 16))
# but decoding is builtin
print r==s.decode('hex') # => True

As you can see your whole decoding is s.decode('hex').

But "lazy" decoding sounds like premature optimization to me. You'd need gigabytes of data to even notice it. Try profiling, the .decode is 50 times faster that your old code already.

Maybe you want somthing like this:

class DB(object): # dunno what data it is ;)
    def __init__(self, data):
        self.data = data
        self.decoded = {} # maybe cache if the field data is long
    def __getitem__(self, name):
        try:
            return self.decoded[name]
        except KeyError:
            # this copies the fields data
            self.decoded[name] = ret = self.data[ self._get_field_slice( name ) ].decode('hex')
            return ret
    def _get_field_slice(self, name):
        # find out what part to decode, return the index in the data
        return slice( ... )

db = DB(encoded_data)    
print db["some_field"] # find out where the field is, get its data and decode it
THC4k
Nice! I didn't know about .decode(hex). I'll use that. But indeed, I AM doing it on petabytes of data on Hadoop, so I don't want to decode all the data, just the part that the specific programs asks for. I profiled my code and 99% of the time was spent in the decoding loop, and 1% in the parser, so that's why I'm working on this part :)
Paul Tarjan
Thank you for the edit. Its close, but I want to emulate a string, and not a whole database. So I want a class that looks a feels like a string when accessed, but doesn't actually decode its data until it is used. Is that possible? And does my sample code work for that?
Paul Tarjan
A: 

The methods you need to override really depend on how are planning to use you new string type.

However you str based type looks a little suspicious to me, have you looked into the implementation of str to check that it has the value attribute that you are setting in your __init__()? Performing a dir(str) does not indicate that there is any such attribute on str. This being the case the normal str methods will not be operating on your data at all, I doubt that is the effect you want otherwise what would be the advantage of sub-classing.

Sub-classing base data types is a little strange anyway unless you have very specific requirements. For the lazy evaluation you want you are probably better of creating your class that contains a string rather than sub-classing str and write your client code to work with that class. You will then be free to add the just in time evaluation you want in a number of ways an example using the descriptor protocol can be found in this presentation: Python's Object Model (search for "class Jit(object)" to get to the relevant section)

Tendayi Mawushe
`value` is my own incantation. I'd rather this feel like a real string to my end users, instead of having to use another class. Should I just implement every method in `dir(str)`? Do they all rely on `__str__` themselves?
Paul Tarjan
I am not really sure how `str` works internally as it is a built in type implemented in C and I have not looked at the C source for it. However it is unlikely that it uses `__str__` internally. This is a special method that Python calls in various situations to get a string representation of any object. I suspect in the `str` type this method just returns `self`. I can't see why `str` would use this to get to its own data but I could be wrong.
Tendayi Mawushe
A: 

I'm not sure how implementing a string subclass is of much benefit here. It seems to me that if you're processing a stream containing petabytes of data, whenever you've created an object that you don't need to you've already lost the game. Your first priority should be to ignore as much input as you possibly can.

You could certainly build a string-like class that did this:

class mystr(str):
    def __init__(self, value):
        self.value = value
        self._decoded = None
    @property
    def decoded(self):
        if self._decoded == None:
            self._decoded = self.value.decode("hex")
            return self._decoded
    def __repr__(self):
        return self.decoded
    def __len__(self):
        return len(self.decoded)
    def __getitem__(self, i):
        return self.decoded.__getitem__(i)
    def __getslice__(self, i, j):
        return self.decoded.__getslice__(i, j)

and so on. A weird thing about doing this is that if you subclass str, every method that you don't explicitly implement will be called on the value that's passed to the constructor:

>>> s = mystr('a0a1a2')
>>> s
 ¡¢
>>> len(s)
3
>>> s.capitalize()
'A0a1a2'
Robert Rossney
I'm writing a record reader, where the record goes out of scope once the record is done being processed, so I don't think creating the object would be a problem, just the decoding is slow. Do you think otherwise?
Paul Tarjan
When you're dealing with trillions of bytes, everything's a problem. Just reading from BOF to EOF is a problem. I'd benchmark the heck out of this. Like, it may be possible for you to avoid object creation by using `array.fromfile()` to read a large block of data into an array and then process the array using pointers, and only copy the data out of the array when you've identified something you're interested in. Hard to know without knowing more about your application.
Robert Rossney
I'm actually reading all this on Hadoop with HDFS using their CSV Record format. Since I'm running on thousands of computers, the single file doesn't have too much data. I actually wrote a lex and yacc parser for this format : http://github.com/ptarjan/hadoop_record . I think `ply` uses a stringbuffer in memory when parsing, but I'm not sure.
Paul Tarjan
A: 

The question is incomplete, in that the answer will depend on details of the encoding you use.

Say, if you encode a list of strings as pascal strings (i.e. prefixed with string length encoded as a fixed-size integer), and say you want to read the 100th string from the list, you may seek() forward for each of the first 99 strings and not read their contents at all. This will give some performance gain if the strings are large.

If, OTOH, you encode a list of strings as concatenated 0-terminated stirngs, you would have to read all bytes until the 100th 0.

Also, you're speaking about some "fields" but your example looks completely different.

anonymous
My example is a piece of ply parser code. My input is a string, and I am parsing it into records. My records will have some maps, vectors, strings, and numbers in them. In my input string, the strings are hex encoded, and I don't want to decode them unless the user of my library wants them explicitly. Does that help?
Paul Tarjan