tags:

views:

60

answers:

1

Requirements: I have a python project which parses data feeds from multiple sources in varying formats (Atom, valid XML, invalid XML, csv, almost-garbage, etc...) and inserts the resulting data into a database. The catch is the information required to parse each of the feeds must also be stored in the database.

Current solution: My previous solution was to store small python scripts which are eval'ed on the raw data, and return a data object for the parsed data. I'd really like to get away from this method as it obviously opens up a nasty security hole.

Ideal solution: What I'm looking for is what I would describe as a template-driven feed parser for python, so that I can write a template file for each of the feed formats, and this template file would be used to make sense of the various data formats.

I've had limited success finding something like this in the past, and was hoping someone may have a good suggestion.

Thanks everyone!

+1  A: 

Instead of evaling scripts, maybe you should consider making a package of them? Parsing CSV is one thing — the format is simple and regular, parsing XML requires completely another approach. Considering you don't want to write every single parser from scratch, why not just write a bunch of small modules, each having identical API and use them? I believe, using Python itself (not some templating DSL) is ideal for this sort of thing.

For example, this is an approach I've seen in one small torrent-fetching script I'm using:

Main program:

...
def import_plugin(name):
    mod = __import__(name)
    components = name.split('.')
    for comp in components[1:]:
        mod = getattr(mod, comp)
    return mod

...
feed_parser = import_plugin('parsers.%s' % feed['format'])
data = feed_parser(...)
...

parsers/csv.py:

#!/usr/bin/python
from __future__ import absolute_import

import urllib2
import csv

def parse_feed(...):
    ...

If you don't particularly like dynamically loaded modules, you may consider writing, for example, a single module with several parses classes (probably derived from some "abstract parser" base class).

class BaseParser(object):
    ...

class CSVParser(BaseParser):
    ...
register_feed_parser(CSVParser, ['text/plain', 'text/csv'])
...

parsers = get_registered_feed_parsers(feed['mime_type'])
data = None
for parser in parsers:
    try:
        data = parser(feed['data'])
        if data is not None: break
    except ParsingError:
        pass
...
drdaeman
Thanks drdaeman, I really like that solution and may end up using it. The only place where it falls short is the parsing scripts need to be stored in a database.The reason for the database requirement is an administrator of this site ideally would be able to create and manage these parsing scripts (There are dozens of them) in a web interface, but even though administrators are trusted users, its still undesirable to have them enter code that ends up getting eval'ed.I think it will come down to creating a new module or going with your suggestion. Thanks again!
Jon Biddle
Thanks. If the code needs to be accessible by end-users, then maybe I was wrong and creating DSL or sandbox, allowing access only to trusted Python modules and operations is a way to go. Unfortunately I haven't developed anything like that, so I don't have much of ideas. Maybe, this link will be useful, though: http://pypi.python.org/pypi/RestrictedPython/
drdaeman
Thanks again for your advice. I may go with end up going with eval within RestrictedPython concept. Alternatively if I feel ambitious, I might try creating a Python module to do this.
Jon Biddle