views:

899

answers:

6

I have a very basic CSV file upload module working to bulk upload my user's data into my site. I process the CSV file in the backend with a python script that runs on crontab and then email the user the results of the bulk upload. This process works ok operationally, but my issue is with the format of the csv file.

Are there good tools or even basic rules on how to accept different formats of the csv file? The user may have a different order of data columns, slightly different names for the column headers (I want the email column to be entitled "Email", but it may say "Primary Email", "Email Address"), or missing additional data columns. Any good examples of CSV upload functionality that is very permissive and user friendly?

Also, how do I tell the user to export as CSV data? I'm importing address book information, so this data often comes from Outlook, Thunderbird, other software packages that have address books. Are there other popular data formats that I should accept?

+1  A: 

I would handle the random column header mapping in your script once it's uploaded. It's hard to make a "catch all" that would handle whatever the users might enter. I would have it evolve as you go and slowly build a list of one-one relations based on what your user uploads.

Or!

Check the column headers and make sure it's properly formatted and advise them how to fix it if it is not.

"Primary Email" not recognized, our schema is "Email", "Address", "Phone", etc.

You could also accept XML and this would allow you to create your own schema that they would have to adhere to. Check out this tutorial.

hunter
+3  A: 

I'd check out Python's built-in csv module. Frankly a .replace() on your first row should cover your synonyms issue, and if you're using csv.DictReader you should be able to deal with missing columns very easily:

my_dict_reader = csv.DictReader(somecsvfile)
for row in my_dict_reader:
    SomeDBModel.address2=row.get('address2', None)

assuming you wanted to store a None value for missing fields.

Yoni Samlan
This answer covers a lot of important concepts. Only, csv.DictReader(somecsvfile) could have been csv.DictReader(open("somecsvfile.csv","rb"))
Pranab
+2  A: 

You should force the first row to be the headers, make the user match up their headers to your field names on the next page, and remember that mapping for their future dumps.

Whenever I do CSV imports the data really came from an Excel spreadsheet. I've been able to save time by using pyexcelerator to import the .xls directly. My .csv or .xls code is a generator that yields {'field_name':'data', ...} dictionaries that can be assigned to model objects.

If you're doing address data, you should accept vCard.

joeforker
+1  A: 

Take a look at this project: django-batchimport

It might be overkill for you, but it can still give you some good ideas on improving your own code.

Edit: also, ignore that it is only using xlrd for importing Excel. The base concepts are the same, just that you will use the csv module instead of xlrd.

Van Gale
+1  A: 

If you'll copy excel table into clipboard and then paste results into notepad, you'll notice that it's tab separated. I once used it to make bulk import from most of table editors by copy-pasting data from the editor into textarea on html page.

You can use a background for textarea as a hint for number of columns and place your headers at the top suggesting the order for a user.

Javascript will process pasted data and display them to the user immediately with simple prevalidation making it easy to fix an error and repaste.

Then import button is clicked, data is validated again and import results are displayed. Unfortunately, I've never heard any feedback about whenever this was easy to use or not.

Anyway, I still see it as an option when implementing bulk import.

clorz
+1  A: 

Look at csv module from stdlib. It contains presets for popualr CSV dialects like one produced by Excel.

Reader class support field mapping and if file contains column header it coes not depend on column order. For more complex logic, like looking up several alternative names for a field, you'll need to write your own implementation.

Alex Lebedev