beautifulsoup

How to make Beautiful Soup output HTML entities?

I'm trying to sanitize and XSS-proof some HTML input from the client. I'm using Python 2.6 with Beautiful Soup. I parse the input, strip all tags and attributes not in a whitelist, and transform the tree back into a string. However... >>> unicode(BeautifulSoup('text < text')) u'text < text' That doesn't look like valid HTML to me. An...

html5lib/lxml examples for BeautifulSoup users?

I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators. By looking at the docs for html5lib, I came up with this for a test program: import cStringIO f = cStringIO.S...

Beautifulsoup, Python and HTML automatic page truncating?

Hello, I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content. I use the following code to get the set of "div"s: findSet = SoupStrainer('div') set = BeautifulSoup(htmlSource, parseOnlyThese=findSet) for it in set: print it At a certain point, th...

What is the ruby equivalent of the python BeautifulSoup library?

I'm looking for a forgiving HTML parser for scraping HTML and extracting data in Ruby. I've had success using BeautifulSoup for this - what is the ruby equivalent? ...

BeautifulSoup or regex HTML table to data structure?

Hi, I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like <tr> <td>1,1</...

How do I get a list of all parent tags in BeautifulSoup?

Let's say I have a structure like this: <folder name="folder1"> <folder name="folder2"> <bookmark href="link.html"> </folder> </folder> If I point to bookmark, what would be the command to just extract all of the folder lines? For example, bookmarks = soup.findAll('bookmark') then beautifulsoupcommand(bookmarks[...

Get data from the meta tags using BeautifulSoup

I am trying to read the description from the meta tag and this is what I used soup.findAll(name="description") but it does not work, however, the code below works just fine soup.findAll(align="center") How do I read the description from the meta tag in the head of a document? ...

Help interpreting code snippet

I am very new to python and beautifulsoup. In the for statement, what is incident? Is it a class, type, variable? The line following the for.. totally lost. Can someone please explain this code to me? for incident in soup('td', width="90%"): where, linebreak, what = incident.contents[:3] print where.strip() print what.str...

how parsing works

I am trying the sample code for the piracy report. The line of code for incident in soup('td', width="90%"): seraches the soup for an element td with the ad=ttribute width="90%" correct? it invokes the class BeautifulStoneSoup(Tag, SGMLParser): method def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None, ...

I am writing a scraper that downloads all the image files from a multiple pages across the same site and saves them to a specific folder.

the pages have only one variable which changes, and each page only holds one image. (example: http://www.example.com/photos/ooo1.jpg ...http://www.example.com/photos/1745.jpg) I'm currently building the script with python and beautfulSoup but am having a problem creating a loop with the changing variable. I just getting started with ...

parse html beautiful soup

Hi all. I have a html page <a email="[email protected]" href="http://www.max.ru/agent?message&amp;[email protected]" title="Click herе" class="mf_spIco spr-mrim-9"></a><a class="mf_t11" type="booster" href="http://max.ru/mail/corporate/"&gt; I neeed a parse email string soup = BeautifulSoup(data string = soup.find("a",{"ema...

How do I get the html after its manipulated by BeautifulSoup?

I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet. soup = BeautifulSoup(html) for script in soup("script"): soup.script.e...

How to: remove part of a Unicode string in Python following a special character

Hi all first a short summery: python ver: 3.1 system: Linux (Ubuntu) I am trying to do some data retrieval through Python and BeautifulSoup. Unfortunately some of the tables I am trying to process contains cells where the following text string exists: 789.82 ± 10.28 For this i to work i need two things: How do i handle "weird" sym...

Special character use in Python 2.6

I am more than a bit tired, but here goes: I am doing tome HTML scraping in python 2.6.5 with BeautifulSoap on an ubuntubox Reason for python 2.6.5: BeautifulSoap sucks under 3.1 I try to run the following code: # dataretriveal from html files from DETHERM # -*- coding: utf-8 -*- import sys,os,re,csv from BeautifulSoup import Beauti...

Problem with findAll with BeautifulSoup on function

I am new to python and i have a module that works fine with BeautifulSoup and parses the HTML file, i want to use this module as a function on another file, but i copied almost the exact same code in the funcion but now i get this error: AttributeError: 'NoneType' object has no attribute 'findAll' Here is the code of the module that wo...

replacing html tags with BeautifulSoup

I'm currently reformatting some HTML pages with BeautifulSoup, and I ran into bit of a problem. My problem is that the original HTML has things like this: <li><p>stff</p></li> and <li><div><p>Stuff</p></div></li> as well as <li><div><p><strong>stff</strong></p></div><li> With BeautifulSoup I hope to eliminate the div and the ...

python beautifulsoup related problem

hello All. i have some problem to extract some data from html source. following is sniffit of my html source code, and i want to extract string value in every following <td class="gamedate">10/12 00:59</b></td> <td class="gametype">오버언더</b></td> <td class="legue"><nobr style="width:100%;overflow:hidden;letter-spacing:-1;font-size...

beautifulsoup python parsing problem

Possible Duplicate: python beautifulsoup related problem hello All. i have some problem to extract some data from html source. following is sniffit of my html source code, and i want to extract string value in every following <td class="gamedate">10/12 00:59</b></td> <td class="gametype">오버언더</b></td> <td class="legue"...

beautiful soup not parsing site

i started this script in Calibre. when i found out that Calibre can not do what i want, i installed spyder and i am now in the process of making it real python. i am trying (at this stage) to get a list of url from an index. in the new urls, i want to get other urls (sort of an index of indexs) in order to get to the 2ns level index i n...

Beautiful Soup findAll() on the results of a findall() returns TypeError

Hi I'm new to both Python and Beautiful soup. I'm trying to get the text only from a certain part of a table. But it seems the result of a findAll is not a BeautifulSoup type that I can run findAll on again. select = soup.find('table',{'id':"tp_section_1"}) print "got the right table" tissues = select.findAll('td',{"class":re.compile("t...