I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML struct...
Trying to decode an invalid encoded utf-8 html page gives different results in
python, firefox and chrome.
The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'
>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: in...
Hi there,
I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go.
Since this is a private group, I need to login in my google account first.
Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly googl...
Hi folks,
I'm trying to submit a few forms through a Python script, I'm using the mechanized library.
This is so I can implement a temporary API.
The problem is that before after submission a blank page is returned informing that the request is being processed, after a few seconds the page is redirected to the final page.
I underst...
I'm using Nokogiri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")
At this point, the title looks like this:
Rag\30...
I want to scrape text data from a windows application to do additional processing using existing ruby code. Would it be possible to scrape the data as it is updated in the windows application using Ruby and where do I start?
...
Hi,
does anyone know if I can connect my self hosted Openinviter from within a Ruby on Rails or Java app, e.g. through an API? I couldn't find anything in the docs there and the forum isn't very active. It seems to be a good alternative to octazen, who have recently been bought by facebook and won't update their libs anymore.
...
Here's what I'm trying to achieve. I would like to write a script that will navigate to a website that requires me to be authenticated as myself, say Facebook, Live Spaces, Twitter or any other, and then have that script search for certain information on one of the pages of the website.
I've done something similar in the past with the W...
I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form:
However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it.
"Log in" button HTML:
<bu...
How to scrape a page like this.
https://www.procom.ca/JobList.aspx?keywords=&Cities=&reference=&JobType=0
It is secure, and requires a referrer? I can't get anything using wget or httplib2.
If you go through this page, you get a list and it works on a browser but not the command line.
https://www.procom.ca/jobsearch.aspx
...
I seek a tool that can be run on the command line like so:
tablescrape 'http://someURL.foo.com' [n]
If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list.
If n is specified or if there's only one table, it should parse the table and spit i...
I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8
In the following code I have used the simplest regex which targets all apps in the US store.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ...
I'm going to be doing some webscraping and my plan is to have something like this:
public class Searcher
{
public void Search(string searchTerm)
{
}
private void Search(string term)
{
//Some HTMLAgilityPack Voodoo here
}
private void SaveResults()
{
//Actually save the results as .XML f...
I am trying to do some screen-scraping of a website. The content that I want to get is inside of an IFrame. How do I get the InnerText or HTML that is being displayed inside of the IFrame?
I am using .Net 4.0 and C#. I want to be able to do this from a WinForm.
I tried this, but can't find where to get the actual data from...
...
Hello!
For example I need to grab from http://gmail.com/ the number of free storage:
Over <span id=quota>2757.272164</span> megabytes (and counting) of free storage.
And then store those numbers in a MySql database.
The number, as you can see, is dynamically changing.
Is there a way i can setup a server side script that will be grabb...
There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is:
<div id="results"></div>
I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save...
Hey I would like to build an app that could parse a website in order to get specific information. Specifically something that can parse http://www.fedex.com/Tracking?language=english&cntry_code=us&tracknumbers=681780934297262 for the important information. Is there a tutorial out there I could use.
...
Hi, I have around 5 GB of html data which I want to process to find links to a set of websites and perform some additional filtering. Right now I use simple regexp for each site and iterate over them, searching for matches. In my case links can be outside of "a" tags and be not well formed in many ways(like "\n" in the middle of link) so...
I'd like to pull all of a user's tweets. I could do this the hard way (manually scraping twitter) or the easy way: using their api. The problem with the easy (api) way is that I seem to be limited to the 200 most recent tweets. What's a simple way to get all tweets?
Thanks
...
I am trying to scrape a wiktionary entry:
uri = URI.parse("http://en.wiktionary.org/wiki/" + CGI.escape('abjure'))
doc = Nokogiri::HTML(open(uri, 'User-Agent' => 'ruby'))
but the doc shows no elements for this word. The other words work fine and this word used to work. I have no idea what changed. Anyone see anything wrong with thi...