views:

70

answers:

3

I'm trying to scrape the information from Google Translate as a learning exercise and I can't figure out how to reach the content of this span tag.

<span title="Hello" onmouseover="this.style.backgroundColor='#ebeff9'"                                  
      onmouseout="this.style.backgroundColor='#fff'">
    Hallo
</span>

How would I use Python to reach into the contents. Since the 'title' parameter of this span is dynamic, I guess I can target that as a point of entry?

For example trying to translate: Hi, welcome to my house. Would you like a glass of tea or maybe some biscuits?

results in the following html output:

<span title="Hi, welcome to my house." 
onmouseover="this.style.backgroundColor='#ebeff9'" 
onmouseout="this.style.backgroundColor='#fff'">
    Hallo, mein Haus begrüßen zu dürfen. 
</span>
+4  A: 

Checkout BeautifulSoup

Vishal
Will do! Thanks! Wait, I just realized I never imported a different .py file. How would I 'import' this to my project?
Serg
You can use Python package manager to install it and then import it in your program.http://en.wikipedia.org/wiki/EasyInstall
Vishal
B soup comes with an setup script; just run 'python setup.py install'.
Cole
+1  A: 
# -*- coding: utf-8 -*-
def gettext(html):
    for sp in myhtml.split("</span>"):
       if "<span" in sp:
          return sp.rsplit(">")[-1].strip()

myhtml="""
<span title="Hello" onmouseover="this.style.backgroundColor='#ebeff9'"
      onmouseout="this.style.backgroundColor='#fff'">
    Hallo
</span>
"""

print gettext(myhtml)

myhtml="""
<span title="Hi, welcome to my house."
onmouseover="this.style.backgroundColor='#ebeff9'"
onmouseout="this.style.backgroundColor='#fff'">
    Hallo, mein Haus begrüßen zu dürfen.
</span>
"""

print gettext(myhtml)

output

$ python mytranslate.py
Hallo
Hallo, mein Haus begrüßen zu dürfen.
ghostdog74
A: 

Python ships with a few XML and HTML parsers.

I would suggest that you look at the parsers that come with Python first, then look at third party parsers if you don't find any of the included modules acceptable.

Swiss