Can I use python lxml on google app engine? ( or do i have to use Beautiful Soup? )
I have started using Beautiful Soup but it seems slow. I am just starting to play with the idea of "screen scraping" data from other websites to create some sort of "mash-up".
...
Working code: Google dictionary lookup via python and beautiful soup -> simply execute and enter a word.
I've quite simply extracted the first definition from a specific list item. However to get plain data, I've had to split my data at the line break, and then strip it to remove the superfluous list tag.
My question is, is there a met...
From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is f...
Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage... For instance, this webpage is my test case http://www.nytimes.com/2009/12/21/us/21storm.html .. And I mainly want to just get the body text (article) and and maybe even a few tab names here and there. However after trying this suggestion http://stack...
Hi all,
I know this is really basic, but I am really trying to learn Python and want to learn it to scrape data from the web. It may be basic, but I really am trying to learn how to read a HTML table. I can read it into Open Office and it says that it is Table #11.
It seems like BeautifulSoup is the preferred choice, but can anyone...
I have a small script that use urllib2 to get the contents of a site, find all the link tags, appends a small piece of HTML in on the top and bottom, and then I try to prettify it. It keeps returning TypeError: sequence item 1: expected string, Tag found. I have looked around, can't really find the issue. As always, any help, much apprec...
Is there an equivalent of Beautiful Soup's tag.renderContents() method in lxml?
I've tried using element.text, but that doesn't render child tags, as well as ''.join(etree.tostring(child) for child in element), but that doesn't render child text. The closest I've been able to find is etree.tostring(element), but that renders the opening...
I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.
I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice.
Specifically, I want to get at the ...
I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&r=1
Currently I am using BeautifulSoup and the code I have looks like this
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://www.511virginia.org/RoadConditions.aspx...
I am trying to get results for a batch of queries to this demographics tools page: http://adlab.microsoft.com/Demographics-Prediction/DPUI.aspx
The POST action on the form calls the same page (_self) and is probably posting some event data. I read on another post here at stackoverflow that aspx pages typically need some viewstate and va...
Hello,
Sorry if you feel like this has been asked but I have read the related questions and being quite new to Python I could not find how to write this request in a clean manner.
For now I have this minimal Python code:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2
br = Browser()
b...
I want to use BeautfulSoup to search and replace <\a> with <\a><br>. I know how to open with urllib2 and then parse to extract all the <a> tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated.
EDIT
I would assume it would be something similar to:
soup.findAll('a...
I am aware of the ability to edit text with beautifulsoup, is it possible to edit the href links? I would like to be able to take say <a href="/foo/bar/"> and use beautifulsoup to change it to <a href="http://www.foobarinc.com/foo/bar/">. I am not sure how I would use beautifulsoup to do this? Any help, much appreciated.
...
I am writing an HTML document with BeautifulSoup, and I would like it to not split inline text (such as text within the <p> tag) into multiple lines. The issue that I get is that parsing the <p>a<span>b</span>c</p> with prettify gives me the output
<p>
a
<span>
b
</span>
c
</p>
and now the HTML displays spaces between a,b,c, which ...
soup.find("tagName", { "id" : "articlebody" })
Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from
soup.prettify()
soup.find("div", { "id" : "articlebody" }) also does not work.
Edit: There is no answer to...
How can I use BeautifulSoup to find all the links in a page pointing to a specific domain?
...
I'm trying to parse an RSS/Podcast feed using Beautifulsoup and everything is working nicely except I can't seem to parse the 'pubDate' field.
data = urllib2.urlopen("http://www.democracynow.org/podcast.xml")
dom = BeautifulStoneSoup(data, fromEncoding='utf-8')
items = dom.findAll('item');
for item in items:
title = item.find('titl...
Using BeautifulSoup to parse my XML
import BeautifulSoup
soup = BeautifulSoup.BeautifulStoneSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan'])
print soup.prettify()
This will output:
<alan x="y">
<anne>
hello
</anne>
</alan>
ie, the anne tag is a child of the alan tag.
If I pass selfClosingTags=['alan...
I am using BeautifulSoup to parse XML:
xml = """<person>
<first_name>Matt</first_name>
</person>"""
soup = BeautifulStoneSoup(xml)
first_name = soup.find('first_name').string
last_name = soup.find('last_name').string
But I have a problem when there is no last_name, because it chokes. Sometimes the feed has it, and sometimes it doesn...
I need data from table in text file (output.txt) in this format:
data1;data2;data3;data4;.....
Celkova podlahova plocha bytu;33m;Vytah;Ano;Nadzemne podlazie;Prizemne podlazie;.....;Forma vlastnictva;Osobne
All in "one line", separator is ";" (later export in csv-file).
I´m beginner.. Help, thanks.
from BeautifulSoup import BeautifulS...