The two posts below are great examples of different approaches of extracting data from websites and parsing it into R.
Scraping html tables into R data frames using the XML package
How can I use R (Rcurl/XML packages ?!) to scrape this webpage
I am very new to programming, and am just starting out with R, so I am hoping this questio...
Hi all,
now I have a seemingly easy but challenging task.I need to develop a data set of questions,and I classify the questions into two categories:
Factoid questions: "who is the current president of France."
Free questions: "Can you rate the cameras below for me,please?"
now I need to know the percentage of both categories on Yaho...
With this code I'm trying to download this web page: http://www.kayak.com/s/...
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,'http://www.kayak.com/s/search/air?ai=kayaksample&do=y&ft=ow&ns=n&cb=e&pa=1&l1=ZAG&t1=a&df=dmy&d1=4/10/2010&depart_flex=exact&r1=y&l2=LON&t2=a&d2=11/10/20...
I would like to crawl a website, the problem is, that its full of JavaScript things, such as buttons and such that when they are pressed, they do not change the URL, but the data on the page is changed.
Usually I use LWP / Mechanize etc to crawl sites, but neither support JavaScript.
any idea?
...
If I have an email address, such as [email protected] I can identify that it belongs to the gmail webmail service from the gmail.com domain name. There are also googlemail.com addresses which belong to the same service.
Is there a known list of domains belonging to popular email services?
E.g.
Hotmail (hotmail.com, live.com..)
G...
i want scrap groupon.com now my problem is such sites when you load for the first time asks you to join their email service but when you reload the page they directly show you the content of the page. how do i do it? i am using php for my scripting.
also if anyone could suggest a framework or library in php which makes scraping easy it ...
Hello!
I am scraping a website that uses Javascript to dynamically populate the content of a website with the Html Agility pack.
Basically, I was searching for the XPATH "\\div[@class='PricingInfo']", but that div node was being written to the DOM via Javascript.
So, when I load the page through the Html Agility pack the XPATH mention...
Can you point me on an idea of how to get all the HTML files in a subfolder and all the folders in it of a website?
For example:
www.K.com/goo
I want all the HTML files that are in: www.K.com/goo/1.html, ......n.html
Also, if there are subfolders so I want to get also them: www.K.com/goo/foo/1.html...n.html
...
can I get the correct data/text that is displayed via AJAX using mechanize in ruby?
Or is there any other scripting gem that would allow me to do so?
...
is there any way I can scrape web pages that uses AJAX?
by using something like ruby + mechanize on linux server that doesn't have monitor attached (linode.com for example)
http://watir.com/ would be a solution but I guess not applicable to linode.
...
Can I use Watir to scrape data from a website (AJAX used) but on a linux server without monitor? (linode.com) ?
...
Okay, so this is really messed. I've setup a script to download an mp3 using urllib2 in Python.
url = 'example.com'
req2 = urllib2.Request(url)
response = urllib2.urlopen(req2)
#grab the data
data = response.read()
mp3Name = "song.mp3"
song = open(mp3Name, "w")
song.write(data2)
song.close()
Turns out it was somehow related to me do...
I know the question regarding PHP web page scrapers has been asked time and time and using this, I discovered SimpleHTMLDOM. After working seamlessly on my local server, I uploaded everything to my online server only to find out something wasn't working right. A quick look at the FAQ lead me to this. I'm currently using a free hosting...
Hi,
I know scrapy.org that is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. I used it in some projects and it is very simple to use. But it is written in python.
My question is, are there simlar frameworks for php?
...
I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched...
Here LINK it is suggested that it is possible to "Figure out what the JavaScript is doing and emulate it in your Python code: " This is what I would like help doing ie my question. How do I emulate javascript:__doPostBack ?
Code from a website (full page source here LINK:
<a style="color: Black;" href="javascript:__doPostBack('ctl00$Co...
Hi,
Need to build a little java tool that gets the keyword suggestions and traffic estimates from the google keywords tool at https://adwords.google.com/select/KeywordToolExternal .
The page is rendered in javascript so simple scraping isn't possible. I have tried htmlunit, but it doesn't work (tried diff browserversions .. still no lu...
I need to retrieve text from a remote web site that does not provide an RSS feed.
What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/) with a link that contains the text " Invoices Report ".
For example:
<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tue...
Hi!
I have a small crawler/screen-scraping script that used to work half a year ago, but now, it doesnt work anymore. I checked the html and css values for the reg expression in the page source, but they are still the same, so from this point of view, it should work. Any guesses?
require "open-uri"
# output file
f = open 'results.csv'...
Hi,
which language most suitable for working with cookies and inner page structure with ajax?
thanks
...