Here's one approach.
I notice there is no delimiter between rows. In an effort to clean up the input data, I make a few assumptions:
- The first "row" is the "heading" of a "table", these will be our dictionary keys
- There are no empty fields in the first row (ie: no
""
)
- Any other field can be empty (ie:
""
)
- The first occurrence of two successive
"
indicates the end of the heading row
First I create a response based on your input string:
>>> response = '"Subject";"Start Date";"Start Time";"End Date";"End Time";"All day event";"Description""Play football";"16/11/2009";"10:00 PM";"16/11/2009";"11:00 PM";"false";"""Watch 2012";"20/11/2009";"07:00 PM";"";"08:00 PM";"false";"""";"17/11/2009";"9:00 AM";"17/11/2009";"10:00 AM";"false";""'
Note that
- the "End Date" for "Watch 2012" is empty
- there is a third event with an empty "Subject" heading
These two modifications illustrate some "edge cases" I'm concerned about.
First I will replace all occurrences of two consecutive "
with a pipe (|
) and strip out all other "
characters because I don't need them:
>>> response.replace('""', '|').replace('"', '')
'Subject;Start Date;Start Time;End Date;End Time;All day event;Description|Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;|Watch 2012;20/11/2009;07:00 PM;|;08:00 PM;false;||;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;|'
If we had any empty cells not at the start or end of a row (ie: Watch 2012's End Date), it looks like this: ;|;
-- let's simply leave it blank:
>>> response.replace('""', '|').replace('"', '').replace(';|;', ';;')
'Subject;Start Date;Start Time;End Date;End Time;All day event;Description|Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;|Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;||;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;|'
Now the |
indicates the split between the heading row and the next row. What happens if we split our string on |
?
>>> response.replace('""', '|').replace('"', '').replace(';|;', ';;').split('|')
['Subject;Start Date;Start Time;End Date;End Time;All day event;Description',
'Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;',
'Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;',
'',
';17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;',
'']
Looks like we're getting somewhere. There's a problem, though; there are two items in that list that are just the empty string ''
. They're there because we sometimes have a |
at the end of a row and the beginning of the next row, and splitting creates an empty element:
>>> "a|b||c".split('|')
['a', 'b', '', 'c']
Same goes for a lone delimited at the end of a line, too:
>>> "a||b|c|".split('|')
['a', '', 'b', 'c', '']
Let's filter our list to drop those empty "rows":
>>> rows = [row for row in response.replace('""', '|').replace('"', '').replace(';|;', ';;').split('|') if row]
>>> rows
['Subject;Start Date;Start Time;End Date;End Time;All day event;Description',
'Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;',
'Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;',
';17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;']
That's it for massaging the input; now we just need to build the dictionary. First, let's get the dictionary keys:
>>> dict_keys = rows[0].split(';')
>>> dict_keys
['Subject',
'Start Date',
'Start Time',
'End Date',
'End Time',
'All day event',
'Description']
And build a list of dictionaries, one for each event:
>>> import itertools
>>> events = []
>>> for row in rows[1:]:
... d = {}
... for k, v in itertools.izip(dict_keys, row.split(';')):
... d[k] = v
... events.append(d)
...
>>> events
[{'All day event': 'false',
'Description': '',
'End Date': '16/11/2009',
'End Time': '11:00 PM',
'Start Date': '16/11/2009',
'Start Time': '10:00 PM',
'Subject': 'Play football'},
{'All day event': 'false',
'Description': '',
'End Date': '',
'End Time': '08:00 PM',
'Start Date': '20/11/2009',
'Start Time': '07:00 PM',
'Subject': 'Watch 2012'},
{'All day event': 'false',
'Description': '',
'End Date': '17/11/2009',
'End Time': '10:00 AM',
'Start Date': '17/11/2009',
'Start Time': '9:00 AM',
'Subject': ''}]
Hope that helps!
Some notes:
- if you expect
|
to appear in your data, you might want to encode it first; or use a different delimiter
- supporting quotes in the data might be tricky (ie: 'Subject': 'Watching "2012"')
- I leave conversion of 'All day event' values from string to boolean as an exercise to the reader :D