ansaurus

Question

How can I translate this XPath expression to BeautifulSoup?

Answer 1

+4 A:

one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath

Edit:
try ~~(untested)~~ tested:

soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)

I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html

soup should be a BeautifulSoup object

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)

cobbal 2009-11-29 05:41:58

I don't look forward to this Windows installation http://codespeak.net/lxml/installation.html if I can avoid it. Otherwise looks much nicer than BeautifulSoup (documentation wise)

Zeynel 2009-11-29 05:55:30

from the BS documentation:Here are some ways to navigate the soup:soup.contents[0].name# u'html'When I try, I get:soup.contents[0].nameTraceback (most recent call last): File "<pyshell#316>", line 1, in <module> soup.contents[0].name File "C:\Python26\BeautifulSoup.py", line 427, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)AttributeError: 'NavigableString' object has no attribute 'name'

Zeynel 2009-11-29 06:00:40

Answer 2

+1 A:

It seems that you are using BeautifulSoup 3.1

I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)

I just tested with 3.0.7 and got the results you expect:

>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]

Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.

Mark Peters 2009-11-29 17:48:20

Answer 3

+1 A:

I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.

The thread is located at the Google Groups archive.

Aaron DeVore 2009-11-29 20:16:04

Answer 4

+3 A:

I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&amp;FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

I extracted 914 references from your page, from Abel to Zupikova.

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
AcuÃ±a, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
ZiÃ³Å‚ek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

Paul McGuire 2009-11-29 21:20:40

I'll definitely try pyparsing. This makes more sense to me than BeautifulSoup.

Zeynel 2009-11-29 22:53:43

ansaurus

tags:

views:

answers:

How can I translate this XPath expression to BeautifulSoup?

related questions