ansaurus

Question

Programmatically reading a web page

Answer 1

+1 A:

What you are looking for is called HTML Screen Scraping. A search of "c++" and "scraping" should turn up alot of link.

Will Dieterich 2008-12-23 15:04:36

Answer 2

+17 A:

Have a look at the cURL library:

 #include <stdio.h>
 #include <curl/curl.h>

 int main(void)
 {
   CURL *curl;
   CURLcode res;

   curl = curl_easy_init();
   if(curl) {
     curl_easy_setopt(curl, CURLOPT_URL, "curl.haxx.se");
     res = curl_easy_perform(curl);
      /* always cleanup */
    curl_easy_cleanup(curl);
   }
   return 0;
 }

BTW, if C++ is not strictly required. I encourage you to try C# or Java. It is much easier and there is a built-in way.

m3rLinEz 2008-12-23 15:05:04

+1 for cURL - I've used cURL in one of my C++ applications and it works great, even with proxies and all other obstacles you might encounter.

BlaM 2008-12-23 15:37:21

It's good to advise using the right tool for the job!

xtofl 2008-12-23 16:02:49

It would be better to return an error if curl is null (in above example).

Matthew Flaschen 2008-12-23 23:27:45

Check out curlpp - C++ wrapper for cURL library

Piotr Dobrogost 2009-05-07 11:08:52

Answer 3

+1 A:

You can do it with socket programming, but it's tricky to implement the parts of the protocol needed to reliably fetch a page. Better to use a library, like neon. This is likely to be installed in most Linux distributions. Under FreeBSD use the fetch library.

For parsing the data, because many pages don't use valid XML, you need to implement heuristics, not a real yacc-based parser. You can implement these using regular expressions or a state transition machine. As what you're trying to do involves a lot of trial-and-error you're better off using a scripting language, like Perl. Due to the high network latency you will not see any difference in performance.

Diomidis Spinellis 2008-12-23 15:06:39

While they aren't valid XML, many languages have libraries that have HTML parsers, which will let you use a DOM interface to parse an HTML document.

Daniel Papasian 2008-12-23 15:59:01

Yes, neon is nice too (but most of my experience is with curl, as mentioned in m3rLinEz's answer. Any comparison somewhere?

bortzmeyer 2008-12-23 22:27:07

Answer 4

A:

Try using a library, like Qt, which can read data from across a network and get data out of an xml document. This is an example of how to read an xml feed. You could use the ebay feed for example.

Marius 2008-12-23 15:10:37

Answer 5

+2 A:

Are you sure you want to do this in C++? Perl was explicitly made for such things, and other scripting languages should be good too.

Arkadiy 2008-12-23 15:28:36

Answer 6

+1 A:

A similar question has already been posted, but there is not so much on C++ there.

Ola Eldøy 2008-12-23 15:53:22

Answer 7

A:

There is a free TCP/IP library available for Windows that supports HTTP and HTTPS - using it is very straightforward.

Ultimate TCP/IP

CUT_HTTPClient http;
http.GET("http://folder/file.htm", "c:/tmp/process_me.htm");

You can also GET files and store them in a memory buffer (via CUT_DataSource derived classes). All the usual HTTP support is there - PUT, HEAD, etc. Support for proxy servers is a breeze, as are secure sockets.

Rob 2008-12-23 18:13:55

Answer 8

A:

You're not mentioning any platform, so I give you an answer for Win32.

One simple way to download anything from the Internet is the URLDownloadToFile with the IBindStatusCallback parameter set to NULL. To make the function more useful, the callback interface needs to be implemented.

Johann Gerell 2008-12-30 16:58:57

Answer 9

A:

Stephen Hazel 2009-01-10 18:31:37

Such an ugly code

Piotr Dobrogost 2009-05-07 11:06:47

ugly how? Perhaps you could post YOUR code...

Stephen Hazel 2009-05-08 14:55:03

ansaurus

tags:

views:

answers:

Programmatically reading a web page

related questions