beautifulsoup

"html agility pack" like module for perl

Hi everyone! Can anyone recommend a good module like "html agility pack"(.net) or "Beautiful Soup" for perl? Thanks in advance! ...

Python: Separating an HTML snippets to paragraphs

I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance: ''' <p class="my_class">Hello!</p> <p>What's up?</p> <p style="whatever: whatever;">Goodbye!</p> ''' Should become: ['<p class="my_class">Hello!</p>', '<p>What's up?</p>' '<p style="whatever: w...

Adding text to p tag in Beautiful Soup

I was wondering if anyone knew how to add text to a tag (p, b -- any tag where you might want to include character data). The documentation mentions no where how you might do this. ...

Extract divs with at least one class in BeautifulSoup

Supose you have a web page with a lot of this: <div class="story cid-8797378263432 l-es headline-story thumbnail-true"> where cid-nnnnnnnnnnnn class can vary. How would you get all the divs with BeautifulSoup? I tried: soup.find('div', {'class': 'story'}) but that didn't work. Seems to look for the divs with ONLY the story class. ...

Using BeautifulSoup to parse lines seperated by <br> tags?

I have a page that looks like this: Company A<br /> 123 Main St.<br /> Suite 101<br /> Someplace, NY 1234<br /> <br /> <br /> <br /> Company B<br /> 456 Main St.<br /> Someplace, NY 1234<br /> <br /> <br /> <br /> Sometimes there are two rather than three "br" tags separating the entries. How would I use BeautifulSoup to parse throug...

how to save the output to a text file for a python script?

I'm trying to make it so this script from BeautifulSoup import BeautifulSoup import sys, re, urllib2 import codecs html_str = urllib2.urlopen(URL).read() soup = BeautifulSoup(html_str) for row in soup.findAll("tr"): for col in row.findAll(re.compile("td|th")): for sys.stdout.write((col.string if col.string else '') + '|')...

How do I remove a column from a table in beautifulsoup (Python)

I have an html table, and I would like to remove a column. What is the easiest way to do this with BeautifulSoup or any other python library? ...

How to I make the result of this a variable?

right now its set up to write to a file, but I want it to output the value to a variable. not sure how. from BeautifulSoup import BeautifulSoup import sys, re, urllib2 import codecs woof1 = urllib2.urlopen('someurl').read() woof_1 = BeautifulSoup(woof1) woof2 = urllib2.urlopen('someurl').read() woof_2 = BeautifulSoup(woof2) GE_DB = o...

py2app Not Finding BeautifulSoup

I have a script that uses BeautifulSoup that I want to make into a standalone app using py2app. When I run the app made by py2app I get an error saying that the module BeautifulSoup could not be found. My sys.path has '/Library/Python/2.6/site-packages/BeautifulSoup-3.1.0.1-py2.6.egg' so it seems like it should be there, any advice? ...

how do i stop beautiful soup from skipping rows while parsing?

while using beautifulsoup to parse a table in html every other row starts with <tr class="row_k"> instead of a tr tag without a class Sample HTML <tr class="row_k"> <td><img src="some picture url" alt="Item A"></td> <td><a href="some url"> Item A</a></td> <td>14.8k</td> <td><span class="drop">-555</span></td> <td> <img src="so...

Cannot fetch a web site with python urllib.urlopen() or any web browser other than Shiretoko

Hi, Here is the URL of the site I want to fetch https://salami.parc.com/spartag/GetRepository?friend=jmankoff&amp;keywords=antibiotic&amp;option=jmankoff%27s+tags When I fetch the web site with the following code and display the contents with the following code: sock = urllib.urlopen("https://salami.parc.com/spartag/GetRepository?fri...

Quicker way than "try" and "except" ? - Python

I'm often having code written as follows try: self.title = item.title().content.string except AttributeError, e: self.title = None Is there a quicker way of dealing with this? a one-liner? ...

I want to find the span tag beween the LI tag and its attributes but no luck.

I want to find the span tag beween the LI tag and its attributes. Trying with beautful soap but no luck. Details of my code. Is any one point me right methodlogy In this this code, my getId function should return me id = "0_False-2" Any one know right method? from BeautifulSoup import BeautifulSoup as bs import re html = '<ul>\ <li...

From escaped html -> to regular html? - Python

Hi folks, I used BeautifulSoup to handle XML files that I have collected through a REST API. The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely. Unfortunately I need the HTML code. How would I go on about transforming the escaped HTML into proper markup? Help would be very ...

Get document DOCTYPE with BeautifulSoup

I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object. Given the fllowing html: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org...

Trying to grab just absolute links from a webpage using BeautifulSoup

I am reading the contents of a webpage using BeautifulSoup. What I want is to just grab the <a href> that start with http://. I know in beautifulsoup you can search by the attributes. I guess I am just having a syntax issue. I would imagine it would go something like. page = urllib2.urlopen("http://www.linkpages.com") soup = BeautifulSo...

Use BeautifulSoup to extract sibling nodes between two nodes

I've got a document like this: <p class="top">I don't want this</p> <p>I want this</p> <table> <!-- ... --> </table> <img ... /> <p> and all that stuff too</p> <p class="end>But not this and nothing after it</p> I want to extract everything between the p[class=top] and p[class=end] paragraphs. Is there a nice way I can do thi...

Generate a table of contents from HTML with Python

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags. My plan so far was to: Extract a list of headers using beautifulsoup Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- Th...

Getting BeautifulSoup to find a specific <p>

I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html....

Python beautifulsoup trying to remove html tags 'span'

I am trying to remove [<span class="street-address"> 510 E Airline Way </span>] and I have used this clean function to remove the one that is in between < > def clean(val): if type(val) is not StringType: val = str(val) val = re.sub(r'<.*?>', '',val) val = re.sub("\s+" , " ", val) return val.strip() and ...