This is a soup from a WordPress post detail page:
content = soup.body.find('div', id=re.compile('post'))
title = content.h2.extract()
item['title'] = unicode(title.string)
item['content'] = u''.join(map(unicode, content.contents))
I want to omit the enclosing div tag when assigning item['content']. Is there any way to render all the c...
As a keen windsurfer, I'm interested in how windy the next few weeks are going to be. To that end, I've been writing a little app to scrape a popular weather site (personal use only - not relaying the information or anything) and collate the data into a single graph so that I can easily see when's going to be worth heading out.
I have t...
Hello
I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find...
I have a legacy vb application that has data in a status bar I want to use to drive a .NET application.
I have used spy++ to gain some insight into the window structure and have successfully used FindWindow and FindWindowEx to get handles to the StatusBarWndClass. Now I am struggling to get access to the actual data in the status bar. I...
I need to migrate our website from a proprietary CMS that uses active server pages. Is there a tool or technique that will help download the resources from the existing site? I guess I'm looking for a tool that will crawl and scrape the entire site.
An additional challenge is that the site uses SSL and is protected with forms-based au...
I've got an XHTML document, and I want to select the only table in it with class="index".
If I understand correctly, the descendant axis will select all nodes directly and indirectly descending from the current node, so here's what I've got.
//descendant::table[@class="index"]
It doesn't appear to be working when tested with xmlstar...
I want to pass the results of utidy to Beautiful Soup, ala:
page = urllib2.urlopen(url)
options = dict(output_xhtml=1,add_xml_decl=0,indent=1,tidy_mark=0)
cleaned_html = tidy.parseString(page.read(), **options)
soup = BeautifulSoup(cleaned_html)
When run, the following error results:
Traceback (most recent call last):
File "soup.py...
Would it be possible to write a screen-scraper for a website protected by a form login. I have access to the site, of course, but I have no idea how to login to the site and save my credentials in C#.
Also, any good examples of screenscrapers in C# would be hugely appreciated.
Has this already been done?
...
so i am trying to extract the email of my website using ruby mechanize and hpricot.
what i a trying to do its loop on all the page of my administration side and parse the pages with hpricot.so far so good. Then I get:
Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*
wh...
I am looking for an application that can do the following
a) Programmatically auto login to a page(login.asxp) using HttpWebResponse by using already specified username and password.
b) Detect the redirect URL if the login is successful.
c) Submit another form (settings.aspx) to update certain fields in the database.
The required co...
Hi!
I'd like to use the adblosck plus subscriptions to remove ads from the pages I'm about to scrap. Have anyone used such approach? What is the performance of such solution? What is the algorithm used by the extension itself?
...
I'm trying to debug a DOM scraping packaged called crowbar. Anyhow, when I run I get:
Error: [Exception... "Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIServerSocket.asyncListen]" nsresult: "0xc1f30001 (NS_ERROR_NOT_INITIALIZED)" location: "JS frame :: chrome://crowbar/content/crowbar.js :: onLoad :: l...
import mechanize
url = 'http://steamcommunity.com'
br=mechanize.Browser(factory=mechanize.RobustFactory())
br.open(url)
print br.request
print br.form
for each in br.forms():
print each
print
The above code results in:
Traceback (most recent call last):
File "./mech_test.py", line 12, in <module>
for each in br.forms(...
I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get ...
I'm trying to set up a page that (if it were part of stack overflow) would generates a Stackoverflow Flair Blogger Gadget.
...
I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site.
Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the ...
Hi -
You probably know that IE has this thing where you can save a web page, and it will automatically download the html file and all he image/css/js files that the html file uses.
Now there is one problem with this- the links in the html file are not changed.
So if I download the html page of example.com, which has an < a href=/hi.ht...
What solutions exist for screen scraping a site over SSL for use with .NET?
My use case is that I need to login to a partner website (https), navigate through a dynamic hierarchy, and download a zipped file of reports.
I certainly could use other screen scrapers if there are no good viable options in .NET, either though the framework o...
Hey
I'm currently working on posting a file from a C# application to an image host (KalleLoad.net - with the owners consent, obviously).
I've gotten the actual posting of the request to work, but it's not returning what I expected. The owner of the upload site has provided me with an API (of sorts) which will return some XML with the ...
I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels...