tags:

views:

108

answers:

2

Hi,

I have a small script we're using to read in a CSV file containing employees, and perform some basic manipulations on that data.

We read in the data (import_gd_dump), and create an Employees object, containing a list of Employee objects (maybe I should think of a better naming convention...lol). We then call clean_all_phone_numbers() on Employees, which calls clean_phone_number() on each Employee, as well as lookup_all_supervisors(), on Employees.

import csv
import re
import sys

#class CSVLoader:
#    """Virtual class to assist with loading in CSV files."""
#    def import_gd_dump(self, input_file='Gp Directory 20100331 original.csv'):
#        gd_extract = csv.DictReader(open(input_file), dialect='excel')
#        employees = []
#        for row in gd_extract:
#            curr_employee = Employee(row)
#            employees.append(curr_employee)
#        return employees
#    #self.employees = {row['dbdirid']:row for row in gd_extract}

# Previously, this was inside a (virtual) class called "CSVLoader".
# However, according to here (http://tomayko.com/writings/the-static-method-thing) - the idiomatic way of doing this in Python is not with a class-fucntion but with a module-level function
def import_gd_dump(input_file='Gp Directory 20100331 original.csv'):
    """Return a list ('employee') of dict objects, taken from a Group Directory CSV file."""
    gd_extract = csv.DictReader(open(input_file), dialect='excel')
    employees = []
    for row in gd_extract:
        employees.append(row)
    return employees

def write_gd_formatted(employees_dict, output_file="gd_formatted.csv"):
    """Read in an Employees() object, and write out each Employee() inside this to a CSV file"""
    gd_output_fieldnames = ('hrid', 'mail', 'givenName', 'sn', 'dbcostcenter', 'dbdirid', 'hrreportsto', 'PHFull', 'PHFull_message', 'SupervisorEmail', 'SupervisorFirstName', 'SupervisorSurname')
    try:
        gd_formatted = csv.DictWriter(open(output_file, 'w', newline=''), fieldnames=gd_output_fieldnames, extrasaction='ignore', dialect='excel')
    except IOError:
        print('Unable to open file, IO error (Is it locked?)')
        sys.exit(1)

    headers = {n:n for n in gd_output_fieldnames}
    gd_formatted.writerow(headers)
    for employee in employees_dict.employee_list:
        # We're using the employee object's inbuilt __dict__ attribute - hmm, is this good practice?
        gd_formatted.writerow(employee.__dict__)

class Employee:
    """An Employee in the system, with employee attributes (name, email, cost-centre etc.)"""
    def __init__(self, employee_attributes):
        """We use the Employee constructor to convert a dictionary into instance attributes."""
        for k, v in employee_attributes.items():
            setattr(self, k, v)

    def clean_phone_number(self):
        """Perform some rudimentary checks and corrections, to make sure numbers are in the right format.
        Numbers should be in the form 0XYYYYYYYY, where X is the area code, and Y is the local number."""
        if self.telephoneNumber is None or self.telephoneNumber == '':
            return '', 'Missing phone number.'
        else:
            standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})')
            extra_zero = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})')
            missing_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})(?P<local_second_half>\d{4})')
            if standard_format.search(self.telephoneNumber):
                result = standard_format.search(self.telephoneNumber)
                return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), ''
            elif extra_zero.search(self.telephoneNumber):
                result = extra_zero.search(self.telephoneNumber)
                return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Extra zero in area code - ask user to remediate. '
            elif missing_hyphen.search(self.telephoneNumber):
                result = missing_hyphen.search(self.telephoneNumber)
                return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Missing hyphen in local component - ask user to remediate. '
            else:
                return '', "Number didn't match recognised format. Original text is: " + self.telephoneNumber

class Employees:
    def __init__(self, import_list):
        self.employee_list = []    
        for employee in import_list:
            self.employee_list.append(Employee(employee))

    def clean_all_phone_numbers(self):
        for employee in self.employee_list:
            #Should we just set this directly in Employee.clean_phone_number() instead?
            employee.PHFull, employee.PHFull_message = employee.clean_phone_number()

    # Hmm, the search is O(n^2) - there's probably a better way of doing this search?
    def lookup_all_supervisors(self):
        for employee in self.employee_list:
            if employee.hrreportsto is not None and employee.hrreportsto != '':
                for supervisor in self.employee_list:
                    if supervisor.hrid == employee.hrreportsto:
                        (employee.SupervisorEmail, employee.SupervisorFirstName, employee.SupervisorSurname) = supervisor.mail, supervisor.givenName, supervisor.sn
                        break
                else:
                    (employee.SupervisorEmail, employee.SupervisorFirstName, employee.SupervisorSurname) = ('Supervisor not found.', 'Supervisor not found.', 'Supervisor not found.')
            else:
                (employee.SupervisorEmail, employee.SupervisorFirstName, employee.SupervisorSurname) = ('Supervisor not set.', 'Supervisor not set.', 'Supervisor not set.')

    #Is thre a more pythonic way of doing this?
    def print_employees(self):
        for employee in self.employee_list:
            print(employee.__dict__)

if __name__ == '__main__':
    db_employees = Employees(import_gd_dump())
    db_employees.clean_all_phone_numbers()
    db_employees.lookup_all_supervisors()
    #db_employees.print_employees()
    write_gd_formatted(db_employees)

Firstly, my preamble question is, can you see anything inherently wrong with the above, from either a class design or Python point-of-view? Is the logic/design sound?

Anyhow, to the specifics:

  1. The Employees object has a method, clean_all_phone_numbers(), which calls clean_phone_number() on each Employee object inside it. Is this bad design? If so, why? Also, is the way I'm calling lookup_all_supervisors() bad?
  2. Originally, I wrapped the clean_phone_number() and lookup_supervisor() method in a single function, with a single for-loop inside it. clean_phone_number is O(n), I believe, lookup_supervisor is O(n^2) - is it ok splitting it into two loops like this?
  3. In clean_all_phone_numbers(), I'm looping on the Employee objects, and settings their values using return/assignment - should I be setting this inside clean_phone_number() itself?

There's also a few things that I'm sorted of hacked out, not sure if they're bad practice - e.g. print_employee() and gd_formatted() both use __dict__, and the constructor for Employee uses setattr() to convert a dictionary into instance attributes.

I'd value any thoughts at all. If you think the questions are too broad, let me know and I can repost as several split up (I just didn't want to pollute the boards with multiple similar questions, and the three questions are more or less fairly tightly related).

Cheers, Victor

+2  A: 

Looks fine to me. Good job. How often are you going to run this script? Most of your questions are moot if this is a one-off thing.

  1. I like the way Employees.cleen_all_phone_numbers() delegates to Employee.clean_phone_number()
  2. You really should be using an index (dictionary) here. You can index each employee by hrid when you create them in O(n) and then look them up in O(1).
    • But only do this if you ever have to run the script again...
    • Just get into the habit of using dictionaries. They are painless and make code easier to read. Whenever you write a method lookup_* you probably just want to index a dictionary.
  3. not sure. I like explicitly setting state, but this is actually bad design - clean_phone_number() should do that, Employees should be responsible for their own state.
Daren Thomas
Thanks for the quick response. It'll be run every week or so, the input file changes a bit. We can get deltas, but there's issues with those, and it's apparently easier just to re-write the whole file. Regarding point 2, what exactly did you mean here? Originally I was using dicts (see http://stackoverflow.com/questions/2901872/python-checking-for-membership-inside-nested-dict), however, I moved to a class-based design. Can you add a hashmap/dict to a class? Subclass dict? For point 3, so you're saying from a design POV, I shouldn't return anything, but should use clean_phone_number to set?
victorhooi
your `lookup_all_supervisors` uses a nested loop to find the supervisor for each employee. The nested loop should just be a lookup in a dictionary of supervisors that you can create when reading the employees (single pass) or at a later time (in a second pass). This will bring O(n^2) down to O(n) for assigning supervisors.
Daren Thomas
Ah, yes and regarding 3. exactly that: Your solution updates each employee with what that same employee thinks is a clean phone number. Instead, just tell the employee to clean the friggin phone number! Let the employee manage his own state - outside objects shouldn't be messing with other objects' state!
Daren Thomas
@Daren Thomas: The issue with using a dict is that hrid, the ID number that the supervisor links to, is actually optional. I modified the code to use a dict instead of a list, and it introduced a new bug, where every user with an empty hrid field actually overrode the previous one, so we ended up missing users. Do you know of a workaround?
victorhooi
@Daren Thomas: Ignore last comment, I'm just indexing users with missing ID numberse on their email, you won't be able to find those users on an ID lookup, but it means they'll be in the dict for when you iterate on it.
victorhooi
+1  A: 

you should close your files after reading them I suggest moving all compiled re's tot he top level (otherwise you compile them every call) if self.telephoneNumber is None or self.telephoneNumber == '': cen be easily rewrittent as if not self.telephoneNumber

Guard
@Guard: Thanks for the tips. Hmm, how would I close the files, as I don't have an actual object reference to the file, it's opened as part of the csv.DictReader line? I'll move the re.compile's to instance variables on Employee, is that optimal? Or should they be module level? And yeah, I'll change the None/== line. Thanks again.
victorhooi
change csv.DictReader(open(input_file), ...) to f = open(input_file)csv.DictReader(f, ...)close(f).class-level variables are the best, they are computed inly once when the class is constructed
Guard
@Guard: Thanks for your help. I've already voted for your answer, but I'd like to award you an answer, but it seems I can only tick one at once - you wouldn't happen to know a way around that?
victorhooi
you can pick only one answer, afaiknever mind, happy coding
Guard