views:

142

answers:

6

Hi, I need to read all the HTML text from a url like http://localhost/index.html into a string in C.

I know that if i put on telnet -> telnet www.google.com 80 Get webpage.... it returns all the html.

How do I do this in a linux environment with C?

+1  A: 

You use sockets, interrogate the web server with HTTP (where you have "http://localhost/index.html") and then parse the data which you have received.

Helpful if you are a beginner in socket programming: http://beej.us/guide/bgnet/

Cristina
You have any simp`le axample to do this?? Im going to check the web.
Vanilla
@umetzu: It's not a simple task, so a simple example is hard to write.
Johann Gerell
@Cri don't believe I've ever had my webserver interrogated lol
Earlz
+1  A: 

if you really don't feel like messing around with sockets, you could always create a named temp file, fork off a process and execvp() it to run wget -0 , and then read the input from that temp file.

although this would be a pretty lame and inefficient way to do things, it would mean you wouldn't have to mess with TCP and sending HTTP requests.

chi42
Sorry i didnt understand what you mean.
Vanilla
@umetzu: In .Net terminology, you can skip the sockets if you use pre-built solutions, like the application wget (console app), that can fetch the HTML content and write it to a file that you can read from your app. Launch wget from your app, wait for the process to exit, read the file.
Johann Gerell
tanks im hoing to test it
Vanilla
+3  A: 

Below is a rough outline of code (i.e. not much error checking and I haven't tried to compile it) to get your started, but use http://www.tenouk.com/cnlinuxsockettutorials.html to learn socket programming. Lookup gethostbyname if you need to translate a hostname (like google.com) into an IP address. Also you may need to do some work to parse out the content length from the HTTP response and then make sure you keep calling recv until you've gotten all the bytes.

#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <string.h>
#include <stdlib.h>

void getWebpage(char *buffer, int bufsize, char *ipaddress)
{
    int sockfd;
    struct sockaddr_in destAddr;

    if((sockfd = socket(PF_INET, SOCK_STREAM, 0)) == -1){
        fprintf(stderr, "Error opening client socket\n");
        close(sockfd);
        return;
    }

    destAddr.sin_family = PF_INET;
    destAddr.sin_port = htons(80); // HTTP port is 80
    destAddr.sin_addr.s_addr = inet_addr(ipaddress); // Get int representation of IP
    memset(&(destAddr.sin_zero), 0, 8);

    if(connect(sockfd, (struct sockaddr *)&destAddr, sizeof(struct sockaddr)) == -1){
        fprintf(stderr, "Error with client connecting to server\n");
        close(sockfd);
        return;
    }

    // Send http request
    char *httprequest = "GET / HTTP/1.0";
    send(sockfd, httprequest, strlen(httprequest), 0);
    recv(sockfd, buffer, bufsize, 0);

    // Now buffer has the HTTP response which includes the webpage. You can either
    // trim off the HTTP header, or just leave it in depending on what you are doing
    // with the page
}
jwegan
im going to try this.
Vanilla
i tried this, but when hits recv() it stops. i think is the parameters that im passing to buffer and bufsize.
Vanilla
recv will block waiting for input from the socket file descriptor. Take a look at http://linux.die.net/man/2/recv for more information on how to use recv. Also check the return value of send to make sure the request is getting sent ok.
jwegan
+6  A: 

I would suggest using a couple of libraries, which are commonly available on most Linux distributions:

libcurl and libxml2

libcurl provides a comprehensive suite of http features, and libxml2 provides a module for parsing html, called HTMLParser

Hope that points you in the right direction

amir75
iu was trying to run libcurl, but i didnt know how to compile it. and after i figured out i was getting to much error, thats why i want to tryy with sockets.
Vanilla
Ouch. I just read your comment above about consuming WCF data services from C/Linux. I think your problem is bigger, or at least different, from what the the original question seemed to be asking. Is it a WCF service which is exposed as SOAP? Maybe you should ensure that you've exposed your WCF service as a 'normal' SOAP service (as opposed to any proprietary MS protocol) and then just refer your client to a C SOAP implementation, such as http://sourceforge.net/projects/csoap/
amir75
Hi, im using a version of WCF called data services, that hava e Open Protocol called OData, thats why i think i dont have to use soap, just send a request to the web.
Vanilla
A: 

http://curl.haxx.se/

User1
A: 

Assuming you know how to read a file into a string, I'd try

const char *url_contents(const char *url) {
  // create w3m command and pass it to popen()
  int bufsize = strlen(url) + 100;
  char *buf = malloc(bufsize);
  snprintf(buf, bufsize, "w3m -dump_source '%s'");

  // get a file handle, read all the html from it, close, and return
  FILE *html = popen(buf, "r");
  const char *s = read_file_into_string(html); // you write this function
  fclose(html);
  return s;
}

You fork a process, but it's a lot easier to let w3m do the heavy lifting.

Norman Ramsey