views:

219

answers:

2

heya,

This is a followup questions to this one:

http://stackoverflow.com/questions/2901422/python-dictreader-skipping-rows-with-missing-columns

Turns out I was being silly, and using the wrong ID field.

I'm using Python 3.x here, btw.

I have a dict of employees, indexed by a string, "directory_id". Each value is a nested dict with employee attributes (phone number, surname etc.). One of these values is a secondary ID, say "internal_id", and another is their manager, call it "manager_internal_id". The "internal_id" field is non-mandatory, and not every employee has one.

{'6443410501': {'manager_internal_id': '989634', 'givenName': 'Mary', 'phoneNumber': '+65 3434 3434', 'sn': 'Jones', 'internal_id': '434214'}
'8117062158': {'manager_internal_id': '180682', 'givenName': 'John', 'phoneNumber': '+65 3434 3434', 'sn': 'Ashmore', 'internal_id': ''}
'9227629067': {'manager_internal_id': '347394', 'givenName': 'Wright', 'phoneNumber': '+65 3434 3434', 'sn': 'Earl', 'internal_id': '257839'}
'1724696976': {'manager_internal_id': '907239', 'givenName': 'Jane', 'phoneNumber': '+65 3434 3434', 'sn': 'Bronte', 'internal_id': '629067'}

}

(I've simplified the fields a little, both to make it easier to read, and also for privacy/compliance reasons).

The issue here is that we index (key) each employee by their directory_id, but when we lookup their manager, we need to find managers by their "internal_id".

Before, when our dict was using internal_id as the key, employee.keys() was a list of internal_ids, and I was using a membership check on this. Now, the last part of my if statement won't work, since the internal_ids is part of the dict values, instead of the key itself.

def lookup_supervisor(manager_internal_id, employees):
    if manager_internal_id is not None and manager_internal_id != "" and manager_internal_id in employees.keys():
        return (employees[manager_internal_id]['mail'], employees[manager_internal_id]['givenName'], employees[manager_internal_id]['sn'])
    else:
        return ('Supervisor Not Found', 'Supervisor Not Found', 'Supervisor Not Found')

So the first question is, how do I fix the if statement to check whether the manager_internal_id is present in the dict's list of internal_ids?

I've tried substituting employee.keys() with employee.values(), that didn't work. Also, I'm hoping for something a little more efficient, not sure if there's a way to get a subset of the values, specifically, all the entries for employees[directory_id]['internal_id'].

Hopefully there's some Pythonic way of doing this, without using a massive heap of nested for/if loops.

My second question is, how do I then cleanly return the required employee attributes (mail, givenname, surname etc.). My for loop is iterating over each employee, and calling lookup_supervisor. I'm feeling a bit stupid/stumped here.

def tidy_data(employees):
    for directory_id, data in employees.items():
        # We really shouldnt' be passing employees back and forth like this - hmm, classes?
        data['SupervisorEmail'], data['SupervisorFirstName'], data['SupervisorSurname'] = lookup_supervisor(data['manager_internal_id'], employees)

Should I redesign my data-structure? Or is there another way?

Thanks in advance =), Victor

EDIT: I've tweaked the code slightly, see below:

class Employees:

    def import_gd_dump(self, input_file="test.csv"):
        gd_extract = csv.DictReader(open(input_file), dialect='excel')
        self.employees = {row['directory_id']:row for row in gd_extract}

    def write_gd_formatted(self, output_file="gd_formatted.csv"):
        gd_output_fieldnames = ('internal_id', 'mail', 'givenName', 'sn', 'dbcostcenter', 'directory_id', 'manager_internal_id', 'PHFull', 'PHFull_message', 'SupervisorEmail', 'SupervisorFirstName', 'SupervisorSurname')
        try:
            gd_formatted = csv.DictWriter(open(output_file, 'w', newline=''), fieldnames=gd_output_fieldnames, extrasaction='ignore', dialect='excel')
        except IOError:
            print('Unable to open file, IO error (Is it locked?)')
            sys.exit(1)

        headers = {n:n for n in gd_output_fieldnames}
        gd_formatted.writerow(headers)
        for internal_id, data in self.employees.items():
            gd_formatted.writerow(data)

    def tidy_data(self):
        for directory_id, data in self.employees.items():
            data['PHFull'], data['PHFull_message'] = self.clean_phone_number(data['telephoneNumber'])
            data['SupervisorEmail'], data['SupervisorFirstName'], data['SupervisorSurname'] = self.lookup_supervisor(data['manager_internal_id'])

    def clean_phone_number(self, original_telephone_number):
        standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})')
        extra_zero = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})')
        missing_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})(?P<local_second_half>\d{4})')
        if standard_format.search(original_telephone_number):
            result = standard_format.search(original_telephone_number)
            return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), ''
        elif extra_zero.search(original_telephone_number):
            result = extra_zero.search(original_telephone_number)
            return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Extra zero in area code - ask user to remediate. '
        elif missing_hyphen.search(original_telephone_number):
            result = missing_hyphen.search(original_telephone_number)
            return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Missing hyphen in local component - ask user to remediate. '
        else:
            return '', "Number didn't match format. Original text is: " + original_telephone_number    

    def lookup_supervisor(self, manager_internal_id):
        if manager_internal_id is not None and manager_internal_id != "":# and manager_internal_id in self.employees.values():
            return (employees[manager_internal_id]['mail'], employees[manager_internal_id]['givenName'], employees[manager_internal_id]['sn'])
        else:
            return ('Supervisor Not Found', 'Supervisor Not Found', 'Supervisor Not Found')

if __name__ == '__main__':
    our_employees = Employees()
    our_employees.import_gd_dump('test.csv')
    our_employees.tidy_data()
    our_employees.write_gd_formatted()

I guess (1). I'm looking for a better way to structure/store Employee/Employees, and (2) I'm having issues in particular with lookup_supervisor().\

Should I be creating an Employee Class, and nesting these inside Employees?

And should I even be doing what I'm doing with tidy_data(), and calling clean_phone_number() and lookup_supervisor() on a for loop on the dict's items? Urgh. confused.

+1  A: 

You probably will need to do some iteration to get the data. I assume you don't want an extra dict that can get out of date, so it won't be worth it trying to store everything keyed on internal ids.

Try this on for size:

def lookup_supervisor(manager_internal_id, employees):
    if manager_internal_id is not None and manager_internal_id != "":
        manager_dir_ids = [dir_id for dir_id in employees if employees[dir_id].get('internal_id') == manager_internal_id]
        assert(len(manager_dir_ids) <= 1)
        if len(manager_dir_ids) == 1:
            return manager_dir_ids[0]
    return None

def tidy_data(employees):
    for emp_data in employees.values():
        manager_dir_id = lookup_supervisor(emp_data.get('manager_internal_id'), employees)
        for (field, sup_key) in [('Email', 'mail'), ('FirstName', 'givenName'), ('Surname', 'sn')]:
            emp_data['Supervisor'+field] = (employees[manager_dir_id][sup_key] if manager_dir_id is not None else 'Supervisor Not Found')

And you're definitely right that a class is the answer for passing employees around. In fact, I'd recommend against storing the 'Supervisor' keys in the employee dict, and suggest instead getting the supervisor dict fresh whenever you need it, perhaps with a get_supervisor_data method.

Your new OO version all looks reasonable except for the changes I already mentioned and some tweaks to clean_phone_number.

def clean_phone_number(self, original_telephone_number):
    phone_re = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<extra_zero>0?)(?P<area_code>\d)\)(?P<local_first_half>\d{4})(?P<hyph>-?)(?P<local_second_half>\d{4})')
    result = phone_re.search(original_telephone_number)
    if result is None:
        return '', "Number didn't match format. Original text is: " + original_telephone_number
    msg = ''
    if result.group('extra_zero'):
        msg += 'Extra zero in area code - ask user to remediate. '
    if result.group('hyph'):    # Note: can have both errors at once
        msg += 'Missing hyphen in local component - ask user to remediate. '
    return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), msg

You could definitely make an individual object for each employee, but seeing how you're using the data and what you need from it, I'm guessing it wouldn't have that much payoff.

Mu Mind
@ Mu Mind: I've added more a complete sample of the code above, now in a class (although possibly badly designed) - I'm hoping there's a cleaner solution for it now? Thanks for your answer - also, I wasn't aware you could do assignments inside an if clause in Python (line 3 in your sample)?
victorhooi
Yeah, you definitely can do assignments in an if clause. They're even accessible from outside the if block, after it executes. Python is pretty loose on scoping.
Mu Mind
Interesting, shall have to try that =). Btw, I pasted some cleaned up code above, not sure if that affects your answer? Also, upvoted you. Thanks.
victorhooi
+1  A: 

My python skills are poor, so I am far too ignorant to write out what I have in mind in any kind of reasonable time. But I do know how to do OO decomposition.

Why does the Employees class to do all the work? There are several types of things that your monolithic Employees class does:

  • Read and write data from a file - aka serialization
  • Manage and access data from individual employees
  • Manage relationships between exmployees.

I suggest that you create a class to handle each task group listed.

Define an Employee class to keep track or employee data and handle field processing/tidying tasks.

Use the Employees class as a container for employee objects. It can handle tasks like tracking down an Employee's supervisor.

Define a virtual base class EmployeeLoader to define an interface (load, store, ?? ). Then implement a subclass for CSV file serialization. (The virtual base class is optional--I'm not sure how Python handles virtual classes, so this may not even make sense.)

So:

  • create an instance of EmployeeCSVLoader with a file name to work with.
  • The loader can then build an Employees object and parse the file.
  • As each record is read, a new Employee object will be created and stored in the Employees object.
  • Now ask the Employees object to populate supervisor links.
  • Iterate over the Employees object's collection of employees and ask each one to tidy itself.
  • Finally, let the serialization object handle updating the data file.

Why is this design worth the effort?

It makes things easier to understand. Smaller, task focused objects are easier to create clean, consistent APIs for.

If you find that you need an XML serialization format, it becomes trivial to add the new format. Subclass your virtual loader class to handle the XML parsing/generation. Now you can seamlessly move between CSV and XML formats.

In summary, use objects to simplify and structure your data. Section off common data and behaviors into separate classes. Keep each class tightly focused on a single type of ability. If your class is a collection, accessor, factory, kitchen sink, the API can never be usable: it will be too big and loaded with dissimilar groups of methods. But if your classes stay on topic, they will be easy to test, maintain, use, reuse, and extend.

daotoad
@daotoad: You made a good point, I'm splitting off Employee into their own separate class. However, I'm still not sure of the best way to read in their attributes from the CSV file. csv.DictReader returns a dictionary, with the column names as keys. Is there a way in python to create instance attributes for each Employee based on those column names? Secondly, I'm not sure about the EmployeeLoader/virtual class part, not sure how this works in Python - does anybody else know?
victorhooi
Not sure about the best way to approach class construction. Still working through tutorial. But you should be able to make a class method that takes a dict and uses it to set the attributes of your employee class. As for serialization, don't worry about a base class for now, and just think about the interface it would have. Then define `load` and `store` methods. You'll end up with code like `employees = EmployeeCVSLoader.load('data_file.csv');`. The load method will call the cvs loader to get each dict. Pass the dicts to the `new_from_dict` method.
daotoad
Sorry I don't have enough knowledge to give you good code. It's frustrating for me because I see exactly the design, but I don't have the skill level to render it yet. Do you want a Perl example?
daotoad
Well, yeah, a Perl example might be fine, and it would give me a good direction. I'm sure the other people here can make it more Pythonic =). Upvote you, btw.
victorhooi