heya,
This is a followup questions to this one:
http://stackoverflow.com/questions/2901422/python-dictreader-skipping-rows-with-missing-columns
Turns out I was being silly, and using the wrong ID field.
I'm using Python 3.x here, btw.
I have a dict of employees, indexed by a string, "directory_id". Each value is a nested dict with employee attributes (phone number, surname etc.). One of these values is a secondary ID, say "internal_id", and another is their manager, call it "manager_internal_id". The "internal_id" field is non-mandatory, and not every employee has one.
{'6443410501': {'manager_internal_id': '989634', 'givenName': 'Mary', 'phoneNumber': '+65 3434 3434', 'sn': 'Jones', 'internal_id': '434214'}
'8117062158': {'manager_internal_id': '180682', 'givenName': 'John', 'phoneNumber': '+65 3434 3434', 'sn': 'Ashmore', 'internal_id': ''}
'9227629067': {'manager_internal_id': '347394', 'givenName': 'Wright', 'phoneNumber': '+65 3434 3434', 'sn': 'Earl', 'internal_id': '257839'}
'1724696976': {'manager_internal_id': '907239', 'givenName': 'Jane', 'phoneNumber': '+65 3434 3434', 'sn': 'Bronte', 'internal_id': '629067'}
}
(I've simplified the fields a little, both to make it easier to read, and also for privacy/compliance reasons).
The issue here is that we index (key) each employee by their directory_id, but when we lookup their manager, we need to find managers by their "internal_id".
Before, when our dict was using internal_id as the key, employee.keys() was a list of internal_ids, and I was using a membership check on this. Now, the last part of my if statement won't work, since the internal_ids is part of the dict values, instead of the key itself.
def lookup_supervisor(manager_internal_id, employees):
if manager_internal_id is not None and manager_internal_id != "" and manager_internal_id in employees.keys():
return (employees[manager_internal_id]['mail'], employees[manager_internal_id]['givenName'], employees[manager_internal_id]['sn'])
else:
return ('Supervisor Not Found', 'Supervisor Not Found', 'Supervisor Not Found')
So the first question is, how do I fix the if statement to check whether the manager_internal_id is present in the dict's list of internal_ids?
I've tried substituting employee.keys() with employee.values(), that didn't work. Also, I'm hoping for something a little more efficient, not sure if there's a way to get a subset of the values, specifically, all the entries for employees[directory_id]['internal_id'].
Hopefully there's some Pythonic way of doing this, without using a massive heap of nested for/if loops.
My second question is, how do I then cleanly return the required employee attributes (mail, givenname, surname etc.). My for loop is iterating over each employee, and calling lookup_supervisor. I'm feeling a bit stupid/stumped here.
def tidy_data(employees):
for directory_id, data in employees.items():
# We really shouldnt' be passing employees back and forth like this - hmm, classes?
data['SupervisorEmail'], data['SupervisorFirstName'], data['SupervisorSurname'] = lookup_supervisor(data['manager_internal_id'], employees)
Should I redesign my data-structure? Or is there another way?
Thanks in advance =), Victor
EDIT: I've tweaked the code slightly, see below:
class Employees:
def import_gd_dump(self, input_file="test.csv"):
gd_extract = csv.DictReader(open(input_file), dialect='excel')
self.employees = {row['directory_id']:row for row in gd_extract}
def write_gd_formatted(self, output_file="gd_formatted.csv"):
gd_output_fieldnames = ('internal_id', 'mail', 'givenName', 'sn', 'dbcostcenter', 'directory_id', 'manager_internal_id', 'PHFull', 'PHFull_message', 'SupervisorEmail', 'SupervisorFirstName', 'SupervisorSurname')
try:
gd_formatted = csv.DictWriter(open(output_file, 'w', newline=''), fieldnames=gd_output_fieldnames, extrasaction='ignore', dialect='excel')
except IOError:
print('Unable to open file, IO error (Is it locked?)')
sys.exit(1)
headers = {n:n for n in gd_output_fieldnames}
gd_formatted.writerow(headers)
for internal_id, data in self.employees.items():
gd_formatted.writerow(data)
def tidy_data(self):
for directory_id, data in self.employees.items():
data['PHFull'], data['PHFull_message'] = self.clean_phone_number(data['telephoneNumber'])
data['SupervisorEmail'], data['SupervisorFirstName'], data['SupervisorSurname'] = self.lookup_supervisor(data['manager_internal_id'])
def clean_phone_number(self, original_telephone_number):
standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})')
extra_zero = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})')
missing_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})(?P<local_second_half>\d{4})')
if standard_format.search(original_telephone_number):
result = standard_format.search(original_telephone_number)
return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), ''
elif extra_zero.search(original_telephone_number):
result = extra_zero.search(original_telephone_number)
return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Extra zero in area code - ask user to remediate. '
elif missing_hyphen.search(original_telephone_number):
result = missing_hyphen.search(original_telephone_number)
return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Missing hyphen in local component - ask user to remediate. '
else:
return '', "Number didn't match format. Original text is: " + original_telephone_number
def lookup_supervisor(self, manager_internal_id):
if manager_internal_id is not None and manager_internal_id != "":# and manager_internal_id in self.employees.values():
return (employees[manager_internal_id]['mail'], employees[manager_internal_id]['givenName'], employees[manager_internal_id]['sn'])
else:
return ('Supervisor Not Found', 'Supervisor Not Found', 'Supervisor Not Found')
if __name__ == '__main__':
our_employees = Employees()
our_employees.import_gd_dump('test.csv')
our_employees.tidy_data()
our_employees.write_gd_formatted()
I guess (1). I'm looking for a better way to structure/store Employee/Employees, and (2) I'm having issues in particular with lookup_supervisor().\
Should I be creating an Employee Class, and nesting these inside Employees?
And should I even be doing what I'm doing with tidy_data(), and calling clean_phone_number() and lookup_supervisor() on a for loop on the dict's items? Urgh. confused.