questions about beautifulsoup | ansaurus

beautifulsoup

Can I change BeautifulSoup's behavior regarding converting XML tags to lowercase?

I'm working on code to parse a configuration file written in XML, where the XML tags are mixed case and the case is significant. Beautiful Soup appears to convert XML tags to lowercase by default, and I would like to change this behavior. I'm not the first to ask a question on this subject [see here]. However, I did not understand the...

Scraping Multiple html files to CSV

I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get ...

screen-scraping

Is there anything like hpricot or beautiful soup for php?

Possible Duplicate: Robust, Mature HTML Parser for PHP I am looking for a good way to parse and modify html documents server side in php. Beautiful soup and hpricot look like very good tools but they are not available for php. Are there any good libraries that can do this in php? Tidy appears to be partially what I am looking fo...

Why am I getting "'ResultSet' has no attribute 'findAll'" using BeautifulSoup in Python?

So I am learning Python slowly, and am trying to make a simple function that will draw data from the high scores page of an online game. This is someone else's code that i rewrote into one function (which might be the problem), but I am getting this error. Here is the code: >>> from urllib2 import urlopen >>> from BeautifulSoup import B...

How to get a nested element in beautiful soup

I am struggling with the syntax required to grab some hrefs in a td. The table, tr and td elements dont have any class's or id's. If I wanted to grab the anchor in this example, what would I need? < tr > < td > < a >... Thanks ...

retrieve links from web page using python and beautiful soup

How can I retrieve the links of a webpage and copy the url adress of the links using Python? ...

Parsing HTML rows into CSV

First off the html row looks like this: <tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr> I would show the real html but I am sorry to say don't know how to block it. feels shame Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files i...

screen-scraping

how to fix or make an exception for this error

I'm creating a code that gets image's urls from any web pages, the code are in python and use BeutifulSoup and httplib2. When I run the code, I get the next error: Look me http://movies.nytimes.com (this line is printed by the code) Traceback (most recent call last): File "main.py", line 103, in <module> visit(initialList,profu...

Why is BeautifulSoup throwing this HTMLParseError?

I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed: Traceback (most recent call last): File "mx.py", line 7, in s = BeautifulSoup(content) File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__ File "build\bdist.win32...

Simple python / Beautiful Soup type question

I'm trying to do some simple string manipulation with the href attribute of a hyperlink extracted using Beautiful Soup: from BeautifulSoup import BeautifulSoup soup = BeautifulSoup('<a href="http://www.some-site.com/">Some Hyperlink</a>') href = soup.find("a")["href"] print href print href[href.indexOf('/'):] All I get is: Traceba...

Decoding HTML entities with Python

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong. Take for example: "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’" I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success. ...

character-encoding

Beautiful Soup cannot find a CSS class if the object has other classes, too

if a page has <div class="class1"> and <p class="class1">, then soup.findAll(True, 'class1') will find them both. If it has <p class="class1 class2">, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too? ...

screen-scraping

Web scraping sites that require javascript support

Possible Duplicate: Screen Scraping from a web page with a lot of Javascript I just want to do tasks such as form entry and web scraping, but on sites that require javascript support. And I also need to enter forms, scrape, and so on in the same session. Ideally, I'd like a way to control a web browser from the command line. And...

screen-scraping

Mechanize and BeautifulSoup for PHP?

I was wondering if there was anything similar like Mechanize or BeautifulSoup for PHP? ...

urlopen, BeautifulSoup and UTF-8 Issue

I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source." isbn = 9780141187983 url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn opener = urllib2.build_opener() url_opener = opener.open(url) page = url_ope...

lxml equivalent to BeautifulSoup "OR" syntax?

I'm converting some html parsing code from BeautifulSoup to lxml. I'm trying to figure out the lxml equivalent syntax for the following BeautifullSoup statement: soup.find('a', {'class': ['current zzt', 'zzt']}) Basically I want to find all of the "a" tags in the document that have a class attribute of either "current zzt" or "zzt". ...

Python web scraping involving HTML tags with attributes

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </...

screen-scraping

Parsing out data using BeautifulSoup in Python

I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape. <html> <body> <div class="list-authors"> <span class="descriptor">Authors:</span> <a href="/find/astro-ph/1/au:+Lin_D/0/1/0/all/0/1">Dacheng Lin</a>, <a h...

What pure Python library should I use to scrape a website?

I currently have some Ruby code used to scrape some websites. I was using Ruby because at the time I was using Ruby on Rails for a site, and it just made sense. Now I'm trying to port this over to Google App Engine, and keep getting stuck. I've ported Python Mechanize to work with Google App Engine, but it doesn't support DOM inspecti...

google-app-engine

Why is BeautifulSoup modifying my self-closing elements?

This is the script I have: import BeautifulSoup if __name__ == "__main__": data = """ <root> <obj id="3"/> <obj id="5"/> <obj id="3"/> </root> """ soup = BeautifulSoup.BeautifulStoneSoup(data) print soup When ran, this prints: <root> <obj id="3"></obj> <obj id="5"></obj> <obj id=...

1
2
3
4
5
...
12