views:

584

answers:

4

In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I've been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')

The above expression is from Scrapy. I'm trying to apply the regex re('\.a\w+') to td class altRow to get the links from there.

I would also appreciate pointers to any other tutorials or documentation. I couldn't find any.

Thanks for your help.

Edit: I am looking at this page:

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

Yet, if you look at the page source "/cabel" is there:

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabel'>Abel, Christian</a>

For some reason, search results are not visible to BeautifulSoup, but they are visible to XPath because hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+') catches "/cabel"

Edit: cobbal: It is still not working. But when I search this:

>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

it returns all the links with second character "a" but not the lawyer names. So for some reason those links (such as "/cabel") are not visible to BeautifulSoup. I don't understand why.

+4  A: 

one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath

Edit:
try (untested) tested:

soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)

I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html

soup should be a BeautifulSoup object

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)
cobbal
I don't look forward to this Windows installation http://codespeak.net/lxml/installation.html if I can avoid it. Otherwise looks much nicer than BeautifulSoup (documentation wise)
Zeynel
from the BS documentation:Here are some ways to navigate the soup:soup.contents[0].name# u'html'When I try, I get:soup.contents[0].nameTraceback (most recent call last): File "<pyshell#316>", line 1, in <module> soup.contents[0].name File "C:\Python26\BeautifulSoup.py", line 427, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)AttributeError: 'NavigableString' object has no attribute 'name'
Zeynel
+1  A: 

It seems that you are using BeautifulSoup 3.1

I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)

I just tested with 3.0.7 and got the results you expect:

>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]

Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.

Mark Peters
+1  A: 

I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.

The thread is located at the Google Groups archive.

Aaron DeVore
+3  A: 

I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&amp;FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

I extracted 914 references from your page, from Abel to Zupikova.

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
Acuña, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
Zídek, Aleš /azidek
Ziółek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova
Paul McGuire
I'll definitely try pyparsing. This makes more sense to me than BeautifulSoup.
Zeynel