screen-scraping

How to render contents of a tag in unicode in BeautifulSoup?

This is a soup from a WordPress post detail page: content = soup.body.find('div', id=re.compile('post')) title = content.h2.extract() item['title'] = unicode(title.string) item['content'] = u''.join(map(unicode, content.contents)) I want to omit the enclosing div tag when assigning item['content']. Is there any way to render all the c...

Javascript graphing library to draw a region

As a keen windsurfer, I'm interested in how windy the next few weeks are going to be. To that end, I've been writing a little app to scrape a popular weather site (personal use only - not relaying the information or anything) and collate the data into a single graph so that I can easily see when's going to be worth heading out. I have t...

Screen Scraping from a web page with a lot of Javascript

Hello I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find...

Is it possible to get cell text from a remote applications StatusBarWndClass?

I have a legacy vb application that has data in a status bar I want to use to drive a .NET application. I have used spy++ to gain some insight into the window structure and have successfully used FindWindow and FindWindowEx to get handles to the StatusBarWndClass. Now I am struggling to get access to the actual data in the status bar. I...

How to migrate resources from proprietary CMS?

I need to migrate our website from a proprietary CMS that uses active server pages. Is there a tool or technique that will help download the resources from the existing site? I guess I'm looking for a tool that will crawl and scrape the entire site. An additional challenge is that the site uses SSL and is protected with forms-based au...

Selecting a specific table with XPath

I've got an XHTML document, and I want to select the only table in it with class="index". If I understand correctly, the descendant axis will select all nodes directly and indirectly descending from the current node, so here's what I've got. //descendant::table[@class="index"] It doesn't appear to be working when tested with xmlstar...

Beautiful Soup and uTidy

I want to pass the results of utidy to Beautiful Soup, ala: page = urllib2.urlopen(url) options = dict(output_xhtml=1,add_xml_decl=0,indent=1,tidy_mark=0) cleaned_html = tidy.parseString(page.read(), **options) soup = BeautifulSoup(cleaned_html) When run, the following error results: Traceback (most recent call last): File "soup.py...

Screen-scraping a site with a asp.net form login in C#?

Would it be possible to write a screen-scraper for a website protected by a form login. I have access to the site, of course, but I have no idea how to login to the site and save my credentials in C#. Also, any good examples of screenscrapers in C# would be hugely appreciated. Has this already been done? ...

mechanize html scraping problem

so i am trying to extract the email of my website using ruby mechanize and hpricot. what i a trying to do its loop on all the page of my administration side and parse the pages with hpricot.so far so good. Then I get: Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE* wh...

How to submit a form automatically using HttpWebResponse

I am looking for an application that can do the following a) Programmatically auto login to a page(login.asxp) using HttpWebResponse by using already specified username and password. b) Detect the redirect URL if the login is successful. c) Submit another form (settings.aspx) to update certain fields in the database. The required co...

Using Adblock Plus subscriptions to remove ads from downloaded pages

Hi! I'd like to use the adblosck plus subscriptions to remove ads from the pages I'm about to scrap. Have anyone used such approach? What is the performance of such solution? What is the algorithm used by the extension itself? ...

javascript server under XULRunner fails.

I'm trying to debug a DOM scraping packaged called crowbar. Anyhow, when I run I get: Error: [Exception... "Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIServerSocket.asyncListen]" nsresult: "0xc1f30001 (NS_ERROR_NOT_INITIALIZED)" location: "JS frame :: chrome://crowbar/content/crowbar.js :: onLoad :: l...

How to make mechanize not fail with forms on this page?

import mechanize url = 'http://steamcommunity.com' br=mechanize.Browser(factory=mechanize.RobustFactory()) br.open(url) print br.request print br.form for each in br.forms(): print each print The above code results in: Traceback (most recent call last): File "./mech_test.py", line 12, in <module> for each in br.forms(...

Scraping Multiple html files to CSV

I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get ...

How do I get the user number from inside a stackoverflow page using javascript?

I'm trying to set up a page that (if it were part of stack overflow) would generates a Stackoverflow Flair Blogger Gadget. ...

Identifying hostile web crawlers

I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site. Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the ...

Save an html page + change all links to point to the right place

Hi - You probably know that IE has this thing where you can save a web page, and it will automatically download the html file and all he image/css/js files that the html file uses. Now there is one problem with this- the links in the html file are not changed. So if I download the html page of example.com, which has an < a href=/hi.ht...

Screen scraping over SSL with .NET

What solutions exist for screen scraping a site over SSL for use with .NET? My use case is that I need to login to a partner website (https), navigate through a dynamic hierarchy, and download a zipped file of reports. I certainly could use other screen scrapers if there are no good viable options in .NET, either though the framework o...

HTTP web request not returning what is expected in C#

Hey I'm currently working on posting a file from a C# application to an image host (KalleLoad.net - with the owners consent, obviously). I've gotten the actual posting of the request to work, but it's not returning what I expected. The owner of the upload site has provided me with an API (of sorts) which will return some XML with the ...

crawling scraping and threading? with php

I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening... The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links. The scripts only crawl two levels...