tags:

views:

65

answers:

4

Two-headed question here guys,

First, I've been trying to do some searching for a way to read .xlsx files in python. Does xlrd read .xlsx files now? If not, what's the recommended way to read/write to such a file?

Second, I have two files with similar information. One primary field with scoping subfields (like coordinates(the primary field) -> city -> state -> country). In the older file, the information is given an ID number while the newer file (with records deleted/added) does not have these ID's. In python, I'd 1) open the two files 2) check the primary field of the older file against the primary field of the newer file and merge their information to a new file if they match. Given that its not too big of a file, I don't mind the O(n^2) complexity. My question is this: is there a well-defined way to do this in VBA or excel? Everything I think of using excel's library seems too slow and I'm not excellent with VBA.

A: 

Try http://www.python-excel.org/

My mistake - I missed the .xlsx detail.

I guess it's a question of what's easier: finding or writing a library that handles .xlsx format natively OR save all the Excel spreadsheets as .xls and get on with it with the libraries that merely handle the older format.

duffymo
Which of those packages read xlsx files? It seems like `xlrd` (the reader) doesn't support xlsx.
Jon-Eric
From all the information I can gather, neither of those modules seem to support .xlsx files. Am I mistaken here/does a newer version actually support it?
jlv
My mistake - I missed the .xlsx detail
duffymo
+2  A: 

I frequently access excel files through python and xlrd, python and the Excel COM object. For this job, xlrd won't work because it does not support the xlsx format. But no matter, both approaches are overkill for what you are looking for. Simple Excel formulas will deliver what you want, specifically VLOOKUP.

VLOOKUP "looks for a value in the lefmost column of a table, and then returns a value in the same row from the column you specify".

Some advice on VLOOKUP, First, if you want to match on multiple cells, create a "key" cell which concatenates the cells you are interested in (in both workbooks). Second, make sure to set the last argument to VLOOKUP as FALSE because you will only want exact matches.

Regarding performance, excel formulas are often very fast.

Read the help file on VLOOKUP and ask further questions here.

Late edit (from Mark Baker's answer): There is now a python solution for xlsx. Openpyxl was created this year by Eric Gazoni to read and write Excel's xlsx format.

Steven Rumbalski
After talking to one of my colleagues, I'll definitely be doing this in excel and will likely use a lot of your answer. That said, for future reference, can you inform me what I *can* use to access .xlsx files? If there's nothing in python, I wouldn't mind getting some practice in other languages. Unfortunately, it seems like excel-2007 is the standard here (along with the .xlsx format) so it seems like I'll be working with it more than I'd like.
jlv
If your files are less than 65536 rows, the easiest work around is if you have Excel, save the file as "Excel 97-2003 Workbook (*.xls)". Then xlrd will work great. If it's longer than 65536 rows, you're still ok as long as it's a simple table (no formulas, data on just one tab). In that case you could save as CSV file and use Python csv module. If neither of those work, you may need to parse the file directly, xlsx is a zipped xml file. Python has tools to deal with zipped files and with xml files, but you will have to search for documentation on how Excel's xml format.
Steven Rumbalski
@ jlv Since you will "likely use a lot" of my answer, could you accept it? (Not sure how to do that as I have not asked a question here.)
Steven Rumbalski
More info on xlrd and xlsx: On July 15, 2010 John Machin (author or xlrd) wrote "Support for reading basic data (open_workbook(..., formatting_info=False)) from Excel 2007 .xlsx and .xlsm files is in alpha test at the moment."
Steven Rumbalski
Wow, incredibly informative! I was intending on naming you the answerer when I got back but now I wish I could hand you an award for "informative posts" or something of the like. I should have realized that I could just convert it to a format that I *can* work in, but thanks for the reminder =)
jlv
+1  A: 

I only heard about this project this morning, so I've not had an opportunity to look at it, and have no idea what it's like; but take a look at Eric' Gazoni's openpyxl project. The code can be found on bitbucket. The driving force behind this was the ability to read/write xlsx files from Python.

Mark Baker
Will be looking into that for future reference. Thanks!
jlv
As a port of my PHP code, I'll be keeping an eye on it as well :)
Mark Baker
Wow, that's a new one. Looks really good.
Steven Rumbalski
A: 

Adding on the answer of Steven Rubalski:

You might want to be able to have your lookup value in any other than the leftmost column. In those cases the Index and Match functions come in handy. See: http://www.mrexcel.com/articles/excel-vlookup-index-match.php

Wilgert Velinga