ansaurus

Question

How to grab dynamic content on website and save it?

Answer 1

+3 A:

Since Gmail doesn't provide any API to get this information, it sounds like you want to do some web scraping.

Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites

There are numerous ways of doing this, as mentioned in the wikipedia article linked before:

Human copy-and-paste: Sometimes even the best Web-scraping technology can not replace human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly setup barriers to prevent machine automation.

Text grepping and regular expression matching: A simple yet powerful approach to extract information from Web pages can be based on the UNIX grep command or regular expression matching facilities of programming languages (for instance Perl or Python).

HTTP programming: Static and dynamic Web pages can be retrieved by posting HTTP requests to the remote Web server using socket programming.

DOM parsing: By embedding a full-fledged Web browser, such as the Internet Explorer or the Mozilla Web browser control, programs can retrieve the dynamic contents generated by client side scripts. These Web browser controls also parse Web pages into a DOM tree, based on which programs can retrieve parts of the Web pages.

HTML parsers: Some semi-structured data query languages, such as the XML query language (XQL) and the hyper-text query language (HTQL), can be used to parse HTML pages and to retrieve and transform Web content.

Web-scraping software: There are many Web-scraping software available that can be used to customize Web-scraping solutions. These software may provide a Web recording interface that removes the necessity to manually write Web-scraping codes, or some scripting functions that can be used to extract and transform Web content, and database interfaces that can store the scraped data in local databases.

Semantic annotation recognizing: The Web pages may embrace metadata or semantic markups/annotations which can be made use of to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer2, are stored and managed separated to the Web pages, so the Web scrapers can retrieve data schema and instructions from this layer before scraping the pages.

And before I continue, please keep in mind the legal implications of all this. I don't know if it's compliant with gmail's terms and I would recommend checking them before moving forward. You might also end up being blacklisted or encounter other issues like this.

All that being said, I'd say that in your case you need some kind of spider and DOM parser to log into gmail and find the data you want. The choice of this tool will depend on your technology stack.

As a ruby dev, I like using Mechanize and nokogiri. Using PHP you could take a look at solutions like Sphider.

marcgg 2010-04-15 13:57:11

Answer 2

A:

One way I can see you doing this (which may not be the most efficient way) is to use PHP and YQL (From Yahoo!). With YQL, you can specify the webpage (www.gmail.com) and the XPATH to get you the value inside the span tag. It's essentially web-scraping but YQL provides you with a nice way to do it using maybe 4-5 lines of code.

You can wrap this whole thing inside a function that gets called every x seconds, or whatever time period you are looking for.

Tilo Mitra 2010-04-15 13:57:58

this wouldn't really work since you need to authenticate yourself before accessing the data OP's looking for

marcgg 2010-04-15 14:00:19

No you wouldn't, the data he is looking for is on the main GMail login screen. www.gmail.com

Tilo Mitra 2010-04-15 14:59:11

How can I call function every second? Cron job? As I know, you can run it once a minute minimum.

Docstero 2010-04-15 15:07:51

Cron could be one solution. AJAX would be another. You could call the PHP script from an AJAX on the page, using a Javascript setInterval method.

Tilo Mitra 2010-04-15 15:38:37

Yes, but you have to keep that web page open in a browser. I was hoping to find some automated server-side solution.

Docstero 2010-04-15 16:08:30

Well, I suppose you could run a PHP script through a cron job that does the following:while true; //perform yql query sleep 2; doneThis way, you are stopping for 2 seconds, and then repeating your yql query. Having said that, take a look at what newtover is saying below. You could try to plot the points of that Gmail storage value, get a rough slope, and just use that to increment your own counter.

Tilo Mitra 2010-04-15 16:54:39

Answer 3

A:

Leaving aside the legality issues in this particular case, I would suggest the following:

Trying to attack something impossible, stop and think where the impossibility comes from, and whether you chose the correct way.

Do you really think that someone in his mind would issue a new http connection or even worse hold an open comet connection to look if the common storage has grown? For an anonimous user? Just look and find a function that computes a value based on some init value and the current time.

newtover 2010-04-15 16:32:42

This is just an example, to explain what I need to do. The real project has nothing to do with gmail. I just wanted to explain that I need to grab dynamic data from website and store it.

Docstero 2010-04-15 16:55:14

@Docstero: dynamic data do not come from nowhere. They are usually the deterministic product of data received from server. It is simplier to talk directly to web-services used by client code (I even managed to do it instead of a FLEX application using pyamf). Otherwise, your application should incorporate a full-fledged browser or be a browser plugin (as Firebug).

newtover 2010-04-15 17:21:23

Answer 4

+1 A:

Initially I thought it was not possible thinking that the number was initialized by javascript.

But if you switch off javascript the number is there in the span tag and probably a javascript function increases it at a regular interval.

So, you can use curl, fopen, etc. to read the contents from the url and then you can parse the contents looking for this value to store it on the datanase. And set this up a cron job to do it on a regular basis.

There are many references on how to do this. Including SO. If you get stuck then just open another question.

Warning: Google have ways of finding out if their apps are being scraped and they will block your IP for a certain period of time. Read the google small print. It's happened to me.

zaf 2010-04-15 16:56:07

ansaurus

tags:

views:

answers:

How to grab dynamic content on website and save it?

related questions