I'm working on code to parse a configuration file written in XML, where the XML tags are mixed case and the case is significant. Beautiful Soup appears to convert XML tags to lowercase by default, and I would like to change this behavior.
I'm not the first to ask a question on this subject [see here]. However, I did not understand the...
I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get ...
Possible Duplicate:
Robust, Mature HTML Parser for PHP
I am looking for a good way to parse and modify html documents server side in php. Beautiful soup and hpricot look like very good tools but they are not available for php. Are there any good libraries that can do this in php? Tidy appears to be partially what I am looking fo...
So I am learning Python slowly, and am trying to make a simple function that will draw data from the high scores page of an online game. This is someone else's code that i rewrote into one function (which might be the problem), but I am getting this error. Here is the code:
>>> from urllib2 import urlopen
>>> from BeautifulSoup import B...
I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks
...
How can I retrieve the links of a webpage and copy the url adress of the links using Python?
...
First off the html row looks like this:
<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>
I would show the real html but I am sorry to say don't know how to block it. feels shame
Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files i...
I'm creating a code that gets image's urls from any web pages, the code are in python and use BeutifulSoup and httplib2.
When I run the code, I get the next error:
Look me http://movies.nytimes.com (this line is printed by the code)
Traceback (most recent call last):
File "main.py", line 103, in <module>
visit(initialList,profu...
I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed:
Traceback (most recent call last):
File "mx.py", line 7, in
s = BeautifulSoup(content)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__
File "build\bdist.win32...
I'm trying to do some simple string manipulation with the href attribute of a hyperlink extracted using Beautiful Soup:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<a href="http://www.some-site.com/">Some Hyperlink</a>')
href = soup.find("a")["href"]
print href
print href[href.indexOf('/'):]
All I get is:
Traceba...
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.
Take for example:
"U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success.
...
if a page has <div class="class1"> and <p class="class1">, then soup.findAll(True, 'class1') will find them both.
If it has <p class="class1 class2">, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too?
...
Possible Duplicate:
Screen Scraping from a web page with a lot of Javascript
I just want to do tasks such as form entry and web scraping, but on sites that require javascript support. And I also need to enter forms, scrape, and so on in the same session. Ideally, I'd like a way to control a web browser from the command line. And...
I was wondering if there was anything similar like Mechanize or BeautifulSoup for PHP?
...
I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source."
isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_ope...
I'm converting some html parsing code from BeautifulSoup to lxml. I'm trying to figure out the lxml equivalent syntax for the following BeautifullSoup statement:
soup.find('a', {'class': ['current zzt', 'zzt']})
Basically I want to find all of the "a" tags in the document that have a class attribute of either "current zzt" or "zzt". ...
I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:
<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</...
I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.
<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span>
<a href="/find/astro-ph/1/au:+Lin_D/0/1/0/all/0/1">Dacheng Lin</a>,
<a h...
I currently have some Ruby code used to scrape some websites. I was using Ruby because at the time I was using Ruby on Rails for a site, and it just made sense.
Now I'm trying to port this over to Google App Engine, and keep getting stuck.
I've ported Python Mechanize to work with Google App Engine, but it doesn't support DOM inspecti...
This is the script I have:
import BeautifulSoup
if __name__ == "__main__":
data = """
<root>
<obj id="3"/>
<obj id="5"/>
<obj id="3"/>
</root>
"""
soup = BeautifulSoup.BeautifulStoneSoup(data)
print soup
When ran, this prints:
<root>
<obj id="3"></obj>
<obj id="5"></obj>
<obj id=...