"html agility pack" like module for perl
Hi everyone! Can anyone recommend a good module like "html agility pack"(.net) or "Beautiful Soup" for perl? Thanks in advance! ...
Hi everyone! Can anyone recommend a good module like "html agility pack"(.net) or "Beautiful Soup" for perl? Thanks in advance! ...
I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance: ''' <p class="my_class">Hello!</p> <p>What's up?</p> <p style="whatever: whatever;">Goodbye!</p> ''' Should become: ['<p class="my_class">Hello!</p>', '<p>What's up?</p>' '<p style="whatever: w...
I was wondering if anyone knew how to add text to a tag (p, b -- any tag where you might want to include character data). The documentation mentions no where how you might do this. ...
Supose you have a web page with a lot of this: <div class="story cid-8797378263432 l-es headline-story thumbnail-true"> where cid-nnnnnnnnnnnn class can vary. How would you get all the divs with BeautifulSoup? I tried: soup.find('div', {'class': 'story'}) but that didn't work. Seems to look for the divs with ONLY the story class. ...
I have a page that looks like this: Company A<br /> 123 Main St.<br /> Suite 101<br /> Someplace, NY 1234<br /> <br /> <br /> <br /> Company B<br /> 456 Main St.<br /> Someplace, NY 1234<br /> <br /> <br /> <br /> Sometimes there are two rather than three "br" tags separating the entries. How would I use BeautifulSoup to parse throug...
I'm trying to make it so this script from BeautifulSoup import BeautifulSoup import sys, re, urllib2 import codecs html_str = urllib2.urlopen(URL).read() soup = BeautifulSoup(html_str) for row in soup.findAll("tr"): for col in row.findAll(re.compile("td|th")): for sys.stdout.write((col.string if col.string else '') + '|')...
I have an html table, and I would like to remove a column. What is the easiest way to do this with BeautifulSoup or any other python library? ...
right now its set up to write to a file, but I want it to output the value to a variable. not sure how. from BeautifulSoup import BeautifulSoup import sys, re, urllib2 import codecs woof1 = urllib2.urlopen('someurl').read() woof_1 = BeautifulSoup(woof1) woof2 = urllib2.urlopen('someurl').read() woof_2 = BeautifulSoup(woof2) GE_DB = o...
I have a script that uses BeautifulSoup that I want to make into a standalone app using py2app. When I run the app made by py2app I get an error saying that the module BeautifulSoup could not be found. My sys.path has '/Library/Python/2.6/site-packages/BeautifulSoup-3.1.0.1-py2.6.egg' so it seems like it should be there, any advice? ...
while using beautifulsoup to parse a table in html every other row starts with <tr class="row_k"> instead of a tr tag without a class Sample HTML <tr class="row_k"> <td><img src="some picture url" alt="Item A"></td> <td><a href="some url"> Item A</a></td> <td>14.8k</td> <td><span class="drop">-555</span></td> <td> <img src="so...
Hi, Here is the URL of the site I want to fetch https://salami.parc.com/spartag/GetRepository?friend=jmankoff&keywords=antibiotic&option=jmankoff%27s+tags When I fetch the web site with the following code and display the contents with the following code: sock = urllib.urlopen("https://salami.parc.com/spartag/GetRepository?fri...
I'm often having code written as follows try: self.title = item.title().content.string except AttributeError, e: self.title = None Is there a quicker way of dealing with this? a one-liner? ...
I want to find the span tag beween the LI tag and its attributes. Trying with beautful soap but no luck. Details of my code. Is any one point me right methodlogy In this this code, my getId function should return me id = "0_False-2" Any one know right method? from BeautifulSoup import BeautifulSoup as bs import re html = '<ul>\ <li...
Hi folks, I used BeautifulSoup to handle XML files that I have collected through a REST API. The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely. Unfortunately I need the HTML code. How would I go on about transforming the escaped HTML into proper markup? Help would be very ...
I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object. Given the fllowing html: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org...
I am reading the contents of a webpage using BeautifulSoup. What I want is to just grab the <a href> that start with http://. I know in beautifulsoup you can search by the attributes. I guess I am just having a syntax issue. I would imagine it would go something like. page = urllib2.urlopen("http://www.linkpages.com") soup = BeautifulSo...
I've got a document like this: <p class="top">I don't want this</p> <p>I want this</p> <table> <!-- ... --> </table> <img ... /> <p> and all that stuff too</p> <p class="end>But not this and nothing after it</p> I want to extract everything between the p[class=top] and p[class=end] paragraphs. Is there a nice way I can do thi...
I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags. My plan so far was to: Extract a list of headers using beautifulsoup Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- Th...
I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html....
I am trying to remove [<span class="street-address"> 510 E Airline Way </span>] and I have used this clean function to remove the one that is in between < > def clean(val): if type(val) is not StringType: val = str(val) val = re.sub(r'<.*?>', '',val) val = re.sub("\s+" , " ", val) return val.strip() and ...