screen-scraping

How to know if the website being scraped has changed?

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead. It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML struct...

Why is python decode replacing more than the invalid bytes from an encoded string?

Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome. The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX' >>> fragment = 'PREFIX\xe3\xabSUFFIX' >>> fragment.decode('utf-8', 'strict') ... UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: in...

How to scrape a _private_ google group?

Hi there, I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go. Since this is a private group, I need to login in my google account first. Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly googl...

How to handle redirects while parsing HTML? - Python

Hi folks, I'm trying to submit a few forms through a Python script, I'm using the mechanized library. This is so I can implement a temporary API. The problem is that before after submission a blank page is returned informing that the request is being processed, after a few seconds the page is redirected to the final page. I underst...

Nokogiri and Special Characters

I'm using Nokogiri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing: require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open(link)) title = doc.at_css("title") At this point, the title looks like this: Rag\30...

Is Screen-scraping a windows application with ruby possible?

I want to scrape text data from a windows application to do additional processing using existing ruby code. Would it be possible to scrape the data as it is updated in the windows application using Ruby and where do I start? ...

Use openinviter with Rails or Java?

Hi, does anyone know if I can connect my self hosted Openinviter from within a Ruby on Rails or Java app, e.g. through an API? I couldn't find anything in the docs there and the forum isn't very active. It seems to be a good alternative to octazen, who have recently been bought by facebook and won't update their libs anymore. ...

How do I automate navigation to a website that requires authentication?

Here's what I'm trying to achieve. I would like to write a script that will navigate to a website that requires me to be authenticated as myself, say Facebook, Live Spaces, Twitter or any other, and then have that script search for certain information on one of the pages of the website. I've done something similar in the past with the W...

Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <bu...

Scraping a page from a secure URL which is possibly using a session ID

How to scrape a page like this. https://www.procom.ca/JobList.aspx?keywords=&amp;Cities=&amp;reference=&amp;JobType=0 It is secure, and requires a referrer? I can't get anything using wget or httplib2. If you go through this page, you get a list and it works on a browser but not the command line. https://www.procom.ca/jobsearch.aspx ...

Scrape HTML tables from a given URL into CSV

I seek a tool that can be run on the command line like so: tablescrape 'http://someURL.foo.com' [n] If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit i...

Scrapy issue with iTunes' AppStore

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8 In the following code I have used the simplest regex which targets all apps in the US store. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from ...

Is there something in C# that lets me keep a .XML file (with its content and everything) in memory, then save the document to disc as a regular XML file?

I'm going to be doing some webscraping and my plan is to have something like this: public class Searcher { public void Search(string searchTerm) { } private void Search(string term) { //Some HTMLAgilityPack Voodoo here } private void SaveResults() { //Actually save the results as .XML f...

How to get InnerText of IFrame from another site?

I am trying to do some screen-scraping of a website. The content that I want to get is inside of an IFrame. How do I get the InnerText or HTML that is being displayed inside of the IFrame? I am using .Net 4.0 and C#. I want to be able to do this from a WinForm. I tried this, but can't find where to get the actual data from... ...

How to grab dynamic content on website and save it?

Hello! For example I need to grab from http://gmail.com/ the number of free storage: Over <span id=quota>2757.272164</span> megabytes (and counting) of free storage. And then store those numbers in a MySql database. The number, as you can see, is dynamically changing. Is there a way i can setup a server side script that will be grabb...

How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is: <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save...

How To Parse A Website?

Hey I would like to build an app that could parse a website in order to get specific information. Specifically something that can parse http://www.fedex.com/Tracking?language=english&amp;cntry_code=us&amp;tracknumbers=681780934297262 for the important information. Is there a tutorial out there I could use. ...

Extract anything that looks like links from large amount of data in python

Hi, I have around 5 GB of html data which I want to process to find links to a set of websites and perform some additional filtering. Right now I use simple regexp for each site and iterate over them, searching for matches. In my case links can be outside of "a" tags and be not well formed in many ways(like "\n" in the middle of link) so...

scrape a user's entire tweets

I'd like to pull all of a user's tweets. I could do this the hard way (manually scraping twitter) or the easy way: using their api. The problem with the easy (api) way is that I seem to be limited to the 200 most recent tweets. What's a simple way to get all tweets? Thanks ...

Nokogiri Doc Element Not Returning Correctly

I am trying to scrape a wiktionary entry: uri = URI.parse("http://en.wiktionary.org/wiki/" + CGI.escape('abjure')) doc = Nokogiri::HTML(open(uri, 'User-Agent' => 'ruby')) but the doc shows no elements for this word. The other words work fine and this word used to work. I have no idea what changed. Anyone see anything wrong with thi...