beautifulsoup

python lxml on app engine?

Can I use python lxml on google app engine? ( or do i have to use Beautiful Soup? ) I have started using Beautiful Soup but it seems slow. I am just starting to play with the idea of "screen scraping" data from other websites to create some sort of "mash-up". ...

Extract data from a website's list, without superfluous tags

Working code: Google dictionary lookup via python and beautiful soup -> simply execute and enter a word. I've quite simply extracted the first definition from a specific list item. However to get plain data, I've had to split my data at the line break, and then strip it to remove the superfluous list tag. My question is, is there a met...

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is f...

BeautifulSoup Grab Visible Webpage Text

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage... For instance, this webpage is my test case http://www.nytimes.com/2009/12/21/us/21storm.html .. And I mainly want to just get the body text (article) and and maybe even a few tab names here and there. However after trying this suggestion http://stack...

How do you get all the rows from a particular table using BeautifulSoup?

Hi all, I know this is really basic, but I am really trying to learn Python and want to learn it to scrape data from the web. It may be basic, but I really am trying to learn how to read a HTML table. I can read it into Open Office and it says that it is Table #11. It seems like BeautifulSoup is the preferred choice, but can anyone...

Cannot prettify scraped html in BeautifulSoup

I have a small script that use urllib2 to get the contents of a site, find all the link tags, appends a small piece of HTML in on the top and bottom, and then I try to prettify it. It keeps returning TypeError: sequence item 1: expected string, Tag found. I have looked around, can't really find the issue. As always, any help, much apprec...

Equivalent of Beautiful Soup's renderContents() method in lxml?

Is there an equivalent of Beautiful Soup's tag.renderContents() method in lxml? I've tried using element.text, but that doesn't render child tags, as well as ''.join(etree.tostring(child) for child in element), but that doesn't render child text. The closest I've been able to find is etree.tostring(element), but that renders the opening...

how can i grab CData out of BeatuifulSoup

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block. I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice. Specifically, I want to get at the ...

BeautifulSoup HTML table parsing

I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&r=1 Currently I am using BeautifulSoup and the code I have looks like this from mechanize import Browser from BeautifulSoup import BeautifulSoup mech = Browser() url = "http://www.511virginia.org/RoadConditions.aspx...

Submitting queries to, and scraping results from aspx pages using python?

I am trying to get results for a batch of queries to this demographics tools page: http://adlab.microsoft.com/Demographics-Prediction/DPUI.aspx The POST action on the form calls the same page (_self) and is probably posting some event data. I read on another post here at stackoverflow that aspx pages typically need some viewstate and va...

How to fetch some data conditionally with Python and Beautiful Soup

Hello, Sorry if you feel like this has been asked but I have read the related questions and being quite new to Python I could not find how to write this request in a clean manner. For now I have this minimal Python code: from mechanize import Browser from BeautifulSoup import BeautifulSoup import re import urllib2 br = Browser() b...

Search and Replace in HTML with BeautifulSoup

I want to use BeautfulSoup to search and replace <\a> with <\a><br>. I know how to open with urllib2 and then parse to extract all the <a> tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated. EDIT I would assume it would be something similar to: soup.findAll('a...

Is it possible to edit the inline code in with BeautifulSoup?

I am aware of the ability to edit text with beautifulsoup, is it possible to edit the href links? I would like to be able to take say <a href="/foo/bar/"> and use beautifulsoup to change it to <a href="http://www.foobarinc.com/foo/bar/"&gt;. I am not sure how I would use beautifulsoup to do this? Any help, much appreciated. ...

Inline parsing in BeautifulSoup in Python

I am writing an HTML document with BeautifulSoup, and I would like it to not split inline text (such as text within the <p> tag) into multiple lines. The issue that I get is that parsing the <p>a<span>b</span>c</p> with prettify gives me the output <p> a <span> b </span> c </p> and now the HTML displays spaces between a,b,c, which ...

Beautiful Soup and extracting a div and its contents by ID

soup.find("tagName", { "id" : "articlebody" }) Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from soup.prettify() soup.find("div", { "id" : "articlebody" }) also does not work. Edit: There is no answer to...

How can I use BeautifulSoup to find all the links in a page pointing to a specific domain?

How can I use BeautifulSoup to find all the links in a page pointing to a specific domain? ...

pubDate RSS parsing weirdness with Beautifulsoup/Python

I'm trying to parse an RSS/Podcast feed using Beautifulsoup and everything is working nicely except I can't seem to parse the 'pubDate' field. data = urllib2.urlopen("http://www.democracynow.org/podcast.xml") dom = BeautifulStoneSoup(data, fromEncoding='utf-8') items = dom.findAll('item'); for item in items: title = item.find('titl...

selfClosingTags in BeautifulSoup

Using BeautifulSoup to parse my XML import BeautifulSoup soup = BeautifulSoup.BeautifulStoneSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan']) print soup.prettify() This will output: <alan x="y"> <anne> hello </anne> </alan> ie, the anne tag is a child of the alan tag. If I pass selfClosingTags=['alan...

Parsing XML with BeautifulSoup and handling missing element

I am using BeautifulSoup to parse XML: xml = """<person> <first_name>Matt</first_name> </person>""" soup = BeautifulStoneSoup(xml) first_name = soup.find('first_name').string last_name = soup.find('last_name').string But I have a problem when there is no last_name, because it chokes. Sometimes the feed has it, and sometimes it doesn...

parsing table with BeautifulSoup and write in text file

I need data from table in text file (output.txt) in this format: data1;data2;data3;data4;..... Celkova podlahova plocha bytu;33m;Vytah;Ano;Nadzemne podlazie;Prizemne podlazie;.....;Forma vlastnictva;Osobne All in "one line", separator is ";" (later export in csv-file). I´m beginner.. Help, thanks. from BeautifulSoup import BeautifulS...