views:

642

answers:

2

I have an object.

    fp = open(self.currentEmailPath, "rb")
    p = email.Parser.Parser()
    self._currentEmailParsedInstance= p.parse(fp)
    fp.close()

self.currentEmailParsedInstance, from this object I want to get the body of an email, text only no html....

How do I do it?


something like this?

        newmsg=self._currentEmailParsedInstance.get_payload()
        body=newmsg[0].get_content....?

then strip the html from body. just what is that .... method to return the actual text... maybe I mis-understand you

        msg=self._currentEmailParsedInstance.get_payload()
        print type(msg)

output = type 'list'


the email

Return-Path: [email protected]
Received: from xx.xx.net (xxxx) by mxx3.xx.net (xxx)
id 485EF65F08EDX5E12 for [email protected]; Thu, 23 Oct 2008 06:07:51 +0200
Received: from xxxxx2 (ccc) by fxxxx.net (ccc) (authenticated as [email protected]) id 48798D4001146189 for [email protected]; Thu, 23 Oct 2008 06:07:51 +0200
From: "xxx" [email protected]
To: [email protected]
Subject: FW: xxx
Date: Thu, 23 Oct 2008 12:07:45 +0800
Organization: xx
Message-ID: <001601c934c4$xxxx30$a9ff460a@xxx>
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_0017_01C93507.F6F64E30"
X-Mailer: Microsoft Office Outlook 11
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3138
Thread-Index: Ack0wLaumqgZo1oXSBuIpUCEg/wfOAABAFEA

This is a multi-part message in MIME format.

------=_NextPart_000_0017_01C93507.F6F64E30
Content-Type: multipart/alternative;
boundary="----=_NextPart_001_0018_01C93507.F6F64E30"

------=_NextPart_001_0018_01C93507.F6F64E30
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit

From: xxxx.xxxx [mailto:[email protected]]
Sent: Thursday, October 23, 2008 11:37 AM
To: [email protected]
Subject: S/I for xxxxx (B/L
No.:4357-0120-810.044)

Pls find attached the xxxx.doc),

Thanks.

B.rgds,

xxx xxx

------=_NextPart_001_0018_01C93507.F6F64E30
Content-Type: text/html;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:st1=3D"urn:schemas-microsoft-com:office:smarttags" =
xmlns=3D"http://www.w3.org/TR/REC-html40">

HTML STUFF till

------=_NextPart_001_0018_01C93507.F6F64E30--

------=_NextPart_000_0017_01C93507.F6F64E30
Content-Type: application/msword;
name="xxxx.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="xxxx.doc"

0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAABAAAAYAAAAAAAAAAA EAAAYgAAAAEAAAD+////AAAAAF8AAAD///////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////s pcEAI2AJBAAA+FK/AAAAAAAAEAAAAAAABgAAnEIAAA4AYmpiaqEVoRUAAAAAAAAAAAAAAAAAAAAA AAAECBYAMlAAAMN/AADDfwAAQQ4AAAAAAAAPAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD//w8AAAAA AAAAAAD//w8AAAAAAAAAAAD//w8AAAAAAAAAAAAAAAAAAAAAAKQAAAAAAEYEAAAAAAAARgQAAEYE AAAAAAAARgQAAAAAAABGBAAAAAAAAEYEAAAAAAAARgQAABQAAAAAAAAAAAAAAFoEAAAAAAAA4hsA AAAAAADiGwAAAAAAAOIbAAA4AAAAGhwAAHwAAACWHAAARAAAAFoEAAAAAAAABzcAAEgBAADmHAAA FgAAAPwcAAAAAAAA/BwAAAAAAAD8HAAAAAAAAPwcAAAAAAAA/BwAAAAAAAD8HAAAAAAAAPwcAAAA AAAAMjYAAAIAAAA0NgAAAAAAADQ2AAAAAAAANDYAAAAAAAA0NgAAAAAAADQ2AAAAAAAANDYAACQA AABPOAAAaAIAALc6AACOAAAAWDYAAGkAAAAAAAAAAAAAAAAAAAAAAAAARgQAAAAAAABHLAAAAAAA AAAAAAAAAAAAAAAAAAAAAAD8HAAAAAAAAPwcAAAAAAAARywAAAAAAABHLAAAAAAAAFg2AAAAAAAA

------=_NextPart_000_0017_01C93507.F6F64E30--


I just want to get :

From: xxxx.xxxx [mailto:[email protected]]
Sent: Thursday, October 23, 2008 11:37 AM
To: [email protected]
Subject: S/I for xxxxx (B/L
No.:4357-0120-810.044)

Pls find attached the xxxx.doc),

Thanks.

B.rgds,

xxx xxx


not sure if the mail is malformed! seems if you get an html page you have to do this:

        parts=self._currentEmailParsedInstance.get_payload()
        print parts[0].get_content_type()
        ..._multipart/alternative_
        textParts=parts[0].get_payload()
        print textParts[0].get_content_type()
        ..._text/plain_
        body=textParts[0].get_payload()
        print body
        ...get the text without a problem!!

thank you so much Vinko.

So its kinda like dealing with xml, recursive in nature.

+3  A: 

This will get you the contents of the message

self.currentEmailParsedInstance.get_payload()

As for the text only part you will have to strip HTML on your own, for example using BeautifulSoup.

Check this link for more information about the Message class the Parser returns. If you mean getting the text part of messages containing both HTML and plain text version of themselves, you can specify an index to get_payload() to get the part you want.

I tried with a different MIME email because what you pasted seems malformed, hopefully it got malformed when you edited it.

>>> parser = email.parser.Parser()
>>> message = parser.parse(open('/home/vinko/jlm.txt','r'))
>>> message.is_multipart()
True
>>> parts = message.get_payload()
>>> len(parts)
2
>>> parts[0].get_content_type()
'text/plain'
>>> parts[1].get_content_type()
'message/rfc822'
>>> parts[0].get_payload()
'Message Text'

parts will contain all parts of the multipart message, you can check their content types as shown and get only the text/plain ones, for instance.

Good luck.

Vinko Vrsalovic
hi, just want the text, I want to throw away the html.pls look at the question again
Setori
Please rewrite your question. How do those emails look like? What is the problem with what you get with get_payload()? Why do you use in one example self.currentEmailParsedInstance and in the other self._email?
Vinko Vrsalovic
A: 

ended up with this

        parser = email.parser.Parser()
        self._email = parser.parse(open('/home/vinko/jlm.txt','r'))
        parts=self._email.get_payload()
        check=parts[0].get_content_type()
        if check == "text/plain":
            return parts[0].get_payload()
        elif check == "multipart/alternative":
            part=parts[0].get_payload()
            if part[0].get_content_type() == "text/plain":
                return part[0].get_payload()
            else:
                return "cannot obtain the body of the email"
        else:
            return "cannot obtain the body of the email"
Setori