webscraping

Extract Links from Webpage using R

The two posts below are great examples of different approaches of extracting data from websites and parsing it into R. Scraping html tables into R data frames using the XML package How can I use R (Rcurl/XML packages ?!) to scrape this webpage I am very new to programming, and am just starting out with R, so I am hoping this questio...

A question on classifying question categories on Yahoo! Answer

Hi all, now I have a seemingly easy but challenging task.I need to develop a data set of questions,and I classify the questions into two categories: Factoid questions: "who is the current president of France." Free questions: "Can you rate the cameras below for me,please?" now I need to know the percentage of both categories on Yaho...

curl problem, can't download full web page

With this code I'm trying to download this web page: http://www.kayak.com/s/... $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,'http://www.kayak.com/s/search/air?ai=kayaksample&do=y&ft=ow&ns=n&cb=e&pa=1&l1=ZAG&t1=a&df=dmy&d1=4/10/2010&depart_flex=exact&r1=y&l2=LON&t2=a&d2=11/10/20...

How can I handle Javascript in a Perl web crawler?

I would like to crawl a website, the problem is, that its full of JavaScript things, such as buttons and such that when they are pressed, they do not change the URL, but the data on the page is changed. Usually I use LWP / Mechanize etc to crawl sites, but neither support JavaScript. any idea? ...

How do I identify a webmail service from an email address?

If I have an email address, such as [email protected] I can identify that it belongs to the gmail webmail service from the gmail.com domain name. There are also googlemail.com addresses which belong to the same service. Is there a known list of domains belonging to popular email services? E.g. Hotmail (hotmail.com, live.com..) G...

web scraping groupon

i want scrap groupon.com now my problem is such sites when you load for the first time asks you to join their email service but when you reload the page they directly show you the content of the page. how do i do it? i am using php for my scripting. also if anyone could suggest a framework or library in php which makes scraping easy it ...

Html Agility Pack: Find Comment Node

Hello! I am scraping a website that uses Javascript to dynamically populate the content of a website with the Html Agility pack. Basically, I was searching for the XPATH "\\div[@class='PricingInfo']", but that div node was being written to the DOM via Javascript. So, when I load the page through the Html Agility pack the XPATH mention...

How can I get all HTML pages from a website subfolder with Perl?

Can you point me on an idea of how to get all the HTML files in a subfolder and all the folders in it of a website? For example: www.K.com/goo I want all the HTML files that are in: www.K.com/goo/1.html, ......n.html Also, if there are subfolders so I want to get also them: www.K.com/goo/foo/1.html...n.html ...

can mechanize read ajax? (ruby)

can I get the correct data/text that is displayed via AJAX using mechanize in ruby? Or is there any other scripting gem that would allow me to do so? ...

any scripting language can read AJAX/Java Script? (linux)

is there any way I can scrape web pages that uses AJAX? by using something like ruby + mechanize on linux server that doesn't have monitor attached (linode.com for example) http://watir.com/ would be a solution but I guess not applicable to linode. ...

Can I use Watir to scrape data from a website on a linux server without monitor?

Can I use Watir to scrape data from a website (AJAX used) but on a linux server without monitor? (linode.com) ? ...

Downloading Mp3 using python in Windows mangles the song however in linux it doesn't?

Okay, so this is really messed. I've setup a script to download an mp3 using urllib2 in Python. url = 'example.com' req2 = urllib2.Request(url) response = urllib2.urlopen(req2) #grab the data data = response.read() mp3Name = "song.mp3" song = open(mp3Name, "w") song.write(data2) song.close() Turns out it was somehow related to me do...

How to scrape websites when cURL and allow_url_fopen is disabled

I know the question regarding PHP web page scrapers has been asked time and time and using this, I discovered SimpleHTMLDOM. After working seamlessly on my local server, I uploaded everything to my online server only to find out something wasn't working right. A quick look at the FAQ lead me to this. I'm currently using a free hosting...

Scraping and web crawling framework, PHP

Hi, I know scrapy.org that is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. I used it in some projects and it is very simple to use. But it is written in python. My question is, are there simlar frameworks for php? ...

How can I scrape a website with invalid HTML

I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched...

Emulate javascript _dopostback in python, web scrapping

Here LINK it is suggested that it is possible to "Figure out what the JavaScript is doing and emulate it in your Python code: " This is what I would like help doing ie my question. How do I emulate javascript:__doPostBack ? Code from a website (full page source here LINK: <a style="color: Black;" href="javascript:__doPostBack('ctl00$Co...

Need to extract results from google keywords external tool ?

Hi, Need to build a little java tool that gets the keyword suggestions and traffic estimates from the google keywords tool at https://adwords.google.com/select/KeywordToolExternal . The page is rendered in javascript so simple scraping isn't possible. I have tried htmlunit, but it doesn't work (tried diff browserversions .. still no lu...

How to perform web scraping to find specific linked pages in Java on Google App Engine?

I need to retrieve text from a remote web site that does not provide an RSS feed. What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/) with a link that contains the text " Invoices Report ". For example: <a href="http://www.example.com/data/invoices/2010/10/invoices-report---tue...

problems in a ruby screen-scraping script

Hi! I have a small crawler/screen-scraping script that used to work half a year ago, but now, it doesnt work anymore. I checked the html and css values for the reg expression in the page source, but they are still the same, so from this point of view, it should work. Any guesses? require "open-uri" # output file f = open 'results.csv'...

Which language to choose for getting data from specified page?

Hi, which language most suitable for working with cookies and inner page structure with ajax? thanks ...