views:

187

answers:

2

heya,

We have an Excel file with a worksheet containing people records.

1. Phone Number Sanitation

One of the fields is a phone number field, which contains phone numbers in the format e.g.:

+XX(Y)ZZZZ-ZZZZ

(where X, Y and Z are integers).

There are also some records which have less digits, e.g.:

+XX(Y)ZZZ-ZZZZ

And others with really screwed up formats:

+XX(Y)ZZZZ-ZZZZ / ZZZZ

or:

ZZZZZZZZ

We need to sanitise these all into the format:

0YZZZZZZZZ (or OYZZZZZZ with those with less digits).

2. Fill in Supervisor Details

Each person also has a supervisor, given as an numeric ID. We need to do a lookup to get the name and email address of that supervisor, and add it to the line.

This lookup will be firstly on the same worksheet (i.e. searching itself), and it can then fallback to another workbook with more people.

3. Approach?

For the first issue, I was thinking of using regex in Excel/VBA somehow, to do the parsing. My Excel-fu isn't the best, but I suppose I can learn...lol. Any particular points on this one?

However, would I be better off exporting the XLS to a CSV (e.g. using xlrd), then using Python to fix up the phone numbers?

For the second approach, I was thinking of just using vlookups in Excel, to pull in the data, and somehow, having it fall through, first on searching itself, then on the external workbook, then just putting in error text. Not sure how to do that last part.

However, if I do happen to choose to export to CSV and do it in Python, what's an efficient way of doing the vlookup? (Should I convert to a dict, or just iterate? Or is there a better, or more idiomatic way?)

Cheers, Victor

A: 

If you go the VBA route, it may pay to take a look at Tushar Mehta's documentation. If you go the Python route, you could try parsing to CSV or, alternatively, just manipulating things in memory and writing via XLWT (which would be my preferred technique). You may also consider just modifying the Excel data directly using COM calls, based on something like this. Finally, if you're committed to doing this outside of Excel, you might take a look at Jython and Apache POI. Not the most lightweight solution, but POI is the most feature-complete library I know of that does not depend on running on Windows.

As other's have observed in comments, it's hard to be concrete with such a broad question. Hopefully something here gets you started...

ig0774
I already had a look at Tushar Mehta's documentation (http://www.tmehta.com/regexp/), before which is really nicely written.However, I got stuck at how to actually extract out my subgroups - I can use RegExpFind to find a certain single regex expression, however, I'm not really sure how to find multiple expression groups, and join them all together, or even how to specify which part of the found expression I actually want to use.E.g. would anybody know the VBA to convert the above phone number +61(2)4534-5453 into the above result 0245325453.
victorhooi
If we do go down the Python route, I'm probably going to use xlrd. So I'd read this into a dictionary, and then do a search on that? Or is there a faster way to do the equavilent of a vlookup in xlrd - as in, search for a particular row, then pull out certain columns from that row (I know that you can write Excel formulas using xlwt, however, I'd rather a more Pythonic way if there is one?)
victorhooi
+1  A: 

In general, avoid Excel formulas; use xlrd to extract the data that you need, then forget it came from Excel and manipulate the data using Python. E.g. addressing the xlrd / vlookup question: the best way would be to create a dictionary ONCE from the relevant parts of the 2 columns containing the keys and values.

Using xlrd to export to CSV and then reading it back is a waste of time AND loses valuable information (like what is the actual type of the data in the Excel cell). If your data was in a database would you export it to CSV and read it back??

John Machin