views:

443

answers:

4

I've been having to do some basic feed processing. So, get a file via ftp, process it (i.e. get the fields I care about), and then update the local database. And similarly the other direction: get data from db, create file, and upload by ftp. The scripts will be called by cron.

I think the idea would be for each type of feed, define the ftp connection/file information. Then there should be a translation of how data fields in the file relate to data fields that the application can work with (and of course process this translation). Additionally write separate scripts that do the common inserting functions for the different objects that may be used in different feeds.

As an e-commerce example, lets say I work with different suppliers who provide feeds to me. The feeds can be different (object) types: product, category, or order information. For each type of feed I obviously work with different fields and call different update or insert scripts.

What is the best language to implement this in? I can work with PHP but am looking for a project to start learning Perl or Python so this could be good for me as well.

If Perl or Python, can you briefly give high level implementation. So how to separate the different scripts, object oriented approach?, how to make it easy to implement new feeds or processing functions in the future, etc.

[full disclosure: There were already classes written in PHP which I used to create a new feed recently. I already did my job, but it was super messy and difficult to do. So this question is not 'Please help me do my job' but rather a 'best approach' type of question for my own development.]

Thanks!

+1  A: 

Most modern languages scripting languages allow you to do all of these things. Because of that, I think your choice of language should be based on what you and the people who read your code know.

In Perl I'd make use of the following modules:

Net::FTP to access the ftp sites. DBI to insert data into your database.

Modules like that are nice reusable pieces of code that you don't have to write, and interaction with ftp sites and databases are so common that every modern scripting language should have similar modules.

I don't think that PHP is a great language so I'd avoid it if possible, but it might make sense for you if you have a lot of experience in it.

James Thompson
Thanks. I do know php, but this question is an excuse to start the process of learning another language as well as using the correct approach/language to this type of common problem.
safoo
+2  A: 

"Best" language is pretty subjective. Python is generally considered to be easy to learn and easy to read, whereas Perl is often jokingly referred to as a "write-only" language. On the other hand, Perl is use extensively for network management. Python tends to be used more for system management or programming in the large. Both have areas of excellence, and areas where they don't work as well.

Either language will allow you to solve your problem fairly easily. They both have all the necessary modules as either bundled libraries, or easily available.

If I were using Python I would use the ConfigParser

http://docs.python.org/library/configparser.html#module-ConfigParser

to store the settings for each project, ftplib:

http://docs.python.org/library/ftplib.html

to talk to the ftp server, and one of the many database libraries. For example, assuming that you are using postgres:

http://www.pygresql.org/

Finally for command line options I would use the excellent option parser module that comes with Python:

http://docs.python.org/library/optparse.html#module-optparse

From a code standpoint I would have the following objects:

# Reads in a config file, decides which feed to use, and passes
# the commands in to one of the classes below for import and export
class FeedManager

# Get data from db into a canonical format
class DbImport

# Put data into db from a canonical format
class DbExport

# Get data from ftp into a canonical format
class FtpImport

# Put data into ftp from canonical format
class FtpExport

each class translates to/from a canonical format that can be handed to one of the other complementary classes.

The config file might look like this:

[GetVitalStats]
SourceUrl=ftp.myhost.com
SourceType=FTP

Destination=Host=mydbserver; Database=somedb
SourceType=Postgres

And finally, you would call it like this:

process_feed.py --feed=GetVitalStats
Christopher
+1: Good. Except for one thing: configparser is useless. Just write Python code. Since Python is interpreted, you don't really need separate config files. It's simpler and you get up and running sooner.
S.Lott
"Perl is use extensively for network management. Python tends to be used more for system management or programming..."So i think in this problem, ftp'ing the files is relatively simpler and its more about what is best to parse the text file and db data.
safoo
The problem with using Python files as config files often boils down to who will be editing the config files. Non-programmers will not understand why they have to wrap things in quotes, and why they have to write \\ instead of \ in a string. Also, the configparser section is much clearer to read than binding things to a dictionary, IMHO. :-)
Christopher
+3  A: 

Kind of depends on the format of the files you're ftp'ing. If it's a crazy proprietary format, you might be stuck with whatever language already has a library managing it. If it's CSV or XML, then any language might do.

Just as examples. It seems pretty straight-forward, but I do perl nearly every day ;-)

Tanktalus
Assume XML, CSV, or tab/character delimited file.
safoo
Config files, since it's mentioned in the Python post: YAML (bit Perl-specific, although it overlaps nicely with JSON) or COnfig::General. (Or Config::Any, of course.)
ijw
+1  A: 

Python.

1st. What format are these FTP'd files? I'll assume they're CSV.

2nd. How do you know when to run the FTP get? Fixed schedule? Event? I'll assume it's a fixed schedule. You'll use cron to control this.

You have three issues: FTP get, data extract, DB load.

ftp_get_load.py

import ftplib
import csv
import someDatabaseAPI as sql

class GetFile( object ):
    ... general case solution using ftplib ...

class ExtractData( object ):
    ... general case solution using csv ...

class LoadDB( object ):
    ... general case solution using sql ...

some_load.py

import ftp_get_load

class UniqueExtractor( ftp_get_load.ExtractData ):
    ... overrides ...

get = GetFile( url, filename, etc. )
extract = UniqueExtractor( filenamein, filenameout, etc. )
load = LoadDB( filename, etc. )

if __name__ == "__main__":
    get.execute()
    extract.execute()
    load.execute()
S.Lott