In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I've been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?
hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')
The above expression is from Scrapy. I'm trying to apply the regex re('\.a\w+')
to td class altRow
to get the links from there.
I would also appreciate pointers to any other tutorials or documentation. I couldn't find any.
Thanks for your help.
Edit: I am looking at this page:
>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>
Yet, if you look at the page source "/cabel"
is there:
<td class="altRow" valign="middle" width="34%">
<a href='/cabel'>Abel, Christian</a>
For some reason, search results are not visible to BeautifulSoup, but they are visible to XPath because hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')
catches "/cabel"
Edit: cobbal: It is still not working. But when I search this:
>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>
it returns all the links with second character "a" but not the lawyer names. So for some reason those links (such as "/cabel") are not visible to BeautifulSoup. I don't understand why.