views:

28

answers:

0

What would be the best way to parse Gmail chat logs from the webpage where it's displayed? As far as I know, this is still the only way to access server-hosted Gmail chat logs (through either desktop Gmail or mobile Gmail).

When looking at the generated source where the conversation takes place, the markup looks like nested divs and spans (and the divs elsewhere on the page have randomized two-character ids and classes with no pattern). Here's an excerpt from a line that has a timestamp to the left:

<div>
<span style="display:block;float:left;color:#888">
2:56 PM&nbsp;
</span>

<span style="display:block;padding-left:6em">
<span>

<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs

</span>
</span>
</div>

But not every line has a timestamp, so those without one seem to place nonbreaking spaces in its place:

<div>
<span style="display:block;float:left;color:#888">
&nbsp;&nbsp;
</span>

<span style="display:block;padding-left:6em">

<span>
and reformat that into something like an xml format
</span>

</span>
</div>

Should I use XPath? Is there something more efficient?

Edit:

As data only, this is what it looks like:

12:43 AM John: Something something something.
         Something something something.
         me: Something something something?
12:44 AM Also, something something something.
12:47 AM Something something something.
12:48 AM Something something something
         with something something something.
12:49 AM John: Something.