webscraping

Scraping *.aspx content using Python

I'm having difficulties scraping dynamically generated table in ASPX. Trying to scrape the gas prices from a site like this GasPrices. I can extract all the information in the gas price table (address, time submitted etc.), except for the actual gas price. Is there a way I could scrape the gas prices? i.e. somehow get a text representa...

Facebook fan page photo's scraping

Hi, We want to add a facebook fan page photo competition to our fan page. The meaning is that ppl can upload photo's and others can like them. The person with the most likes on his photo wins a price. Now i was wondering if anyone knows a good idea on how to get a snapshot of all the photo's on a given moment. So that when we want to s...

What's the fastest way to scrape a lot of pages in php?

I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user. I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs...

Programmatically Submit form and loop through paging (C#.NET)

I need to write a custom web-scraper to mine some data. ?I know how to submit a form using HttpWebRequest class Post method. My challenge is to loop through the resulting pages and retrieve the records from each page. Does anyone have a code sample or article to point to? Thanks ...

Web scraping advice/help with java for android app!

Hey there, I've heard about web scraping software that can take data from a webpage. i'm building an android app and I want to take information from this site www.menupages.ie All I need is the names of the restaurants, and typing them in myself would be very tedious. Can someone tell me how i'd go about doing this in eclipse, what m...

Writing a program to scrape forums

Hi, I need to write a program to scrape forums. Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy? Thanks ...

How to isolate a single element from a scraped web page in R

Hello, I'm trying to do soemone a favour, and it's a tad outside my comfort zone, so I'm stuck. I want to use R to scrape this page (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and times. So far, this is what I've got require(RCurl) require(XML) th...

Problem pulling data from website in .NET and C#

I have written a web scraping program to go to a list of pages and write all the html to a file. The problem is that when I pull a block of text some of the characters get written as '�'. How do I pull those characters into my text file? Here is my code: string baseUri = String.Format("http://www.rogersmushrooms.com/gallery/loadimage...

Simulate Browser Resources Expansion Behavior With Python

I'm looking for a way to simulate browser resources expansion behavior. The flow I'm trying to address is the following: Access an initial URL (e.g. http://example.dmn/index.htm) Parse the html response received (e.g. index.htm) Find the resources that a browser will fetch as a result of the index parsing, e.g.: Images Flash ...

How to scrape data from LocService (http://www.trackdroid.org/locservice.html) using PHP

Hi all. I'm looking to scrape geolocation data from LocService (a solution to track GPS pings from an Android phone) and host it in a MySQL database as a PHP cron job. The login system uses HTTPS. I'm having trouble returning anything through cURL. Has anyone got any ideas? Gausie ...

Retreiving a lot url adresses

Dear Coding Experts, Edit: Just for clarification I am using python, and would like to do this within python. I am in the middle of collecting data for a research project at our university. Basically I need to scrape a lot of information from a website that moniters the European Parliament. Here is an example of how the url of one site...

Cleaning up and removing tags with BeautifulSoup

Hey again all, I have the following script so far: from mechanize import Browser from BeautifulSoup import BeautifulSoup import re import urllib2 br = Browser() br.open("http://www.foo.com") html = br.response().read(); soup = BeautifulSoup(html) items = soup.findAll(id="info") and it runs perfectly, and results in the following ...

Escaping … with BeautifulSoup

I am currrently using BeautifulSoup to scrape some websites, however I have a problem with some specific characters, the code inside UnicodeDammit seems to indicate this (again) are some Microsoft-invented ones. I'm using the newest version of BeautifulSoup(3.0.8.1) as I am still using python2.5 The following code illustrates my proble...

Problem with scraping data using BeautifulSoup

Dear Python Experts, I have written the following trial code to retreive the title of legislative acts from the European parliament. import urllib2 from BeautifulSoup import BeautifulSoup search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN" for number in xran...

new gwt interface automation testing

So our front end GUI is getting a large overhaul to a new GWT based application. I have been working on creating the automation scripts for the old front end using cURL in some tcl/expect scripts. As I have looking at the new app I am starting to realize more and more that cURL is out of the question for performing these web interactions...

web scraping in txt mode

Hi All, I am currently using watir to do a web scraping of a website hiding all data from the usual HTML source. If I am not wrong, they are using XML and those AJAX technology to hide it. Firefox can see it but it is displayed via "DOM Source of selection". Everything works fine but now I am looking for an equivalent tool as watir but...

Determining number of sites on a website in python

Dear Pythonistas, I have the following link: http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN the reference part of the url has the following information: A7 == The parliament (current is the seventh parliament, the former is A6 and so forth) 2010 == year 0001 == docu...

Android/Java: Simulate a click on this webpage.

Hello all Last year I made an Android application that scrapped the informations on my train company in Belgium ( application is BETrains: http://www.cyrket.com/p/android/tof.cv.mpp/) This application was really cool and allowed users to talk with other people in the train ( a messagery server is runned by me) and the conversations wre...

Error with cURL - "Could not resolve host: www.bbb.org(; No data record of requested type"

I am trying to access data of http://www.bbb.org/us/Find-Business-Reviews/ with cURL. Now I used HTTPFox to see what data does this site send and accordingly made an array to "POST" to the page. But I am having problem in accessing Page 2,3,4,5... Here is the array - $array = Array(); $array['__EVENTTARGET'] = 'ctl12$gc1$s$gridResult...

Is there any 'virtual browser' in PHP?

HI, I want to extract data from a website but it uses some strange javascript so I can't get the job done with cURL. I want to know is there anything like virtual browser which opens up the page and I can initiate click on some buttons? If not is there any executable program to achieve this task via command line? ...