views:

2412

answers:

8

I'm looking for a library that has functionality similar to Perl's WWW::Mechanize, but for PHP. Basically, it should allow me to submit HTTP GET and POST requests with a simple syntax, and then parse the resulting page and return in a simple format all forms and their fields, along with all links on the page.

I know about CURL, but it's a little too barebones, and the syntax is pretty ugly (tons of curl_foo($curl_handle, ...) statements

Clarification:

I want something more high-level than the answers so far. For example, in Perl, you could do something like:

# navigate to the main page
$mech->get( 'http://www.somesite.com/' ); 

# follow a link that contains the text 'download this'
$mech->follow_link( text_regex => qr/download this/i );

# submit a POST form, to log into the site
$mech->submit_form(
    with_fields      => {
        username    => 'mungo',
        password    => 'lost-and-alone',
    }
);

# save the results as a file
$mech->save_content('somefile.zip');

To do the same thing using HTTP_Client or wget or CURL would be a lot of work, I'd have to manually parse the pages to find the links, find the form URL, extract all the hidden fields, and so on. The reason I'm asking for a PHP solution is that I have no experience with Perl, and I could probably build what I need with a lot of work, but it would be much quicker if I could do the above in PHP.

A: 

Try looking in the PEAR library. If all else fails, create an object wrapper for curl.

You can so something simple like this:

class curl {
    private $resource;

    public function __construct($url) {
        $this->resource = curl_init($url);
    }

    public function __call($function, array $params) {
        array_unshift($params, $this->resource);
        return call_user_func_array("curl_$function", $params);
    }
}
orlandu63
This isn't quite what I'm looking for, I added a clarification that hopefully makes it more clear, thanks.
davr
A: 

Try one of the following:

(Yes, it's ZendFramework code, but it doesn't make your class slower using it since it just loads the required libs.)

Till
They're still a lot more work than Mechanize, see my clarification to the question.
davr
Good Q. I think none of them do that yet. But I think I'd be up for building it, I am gonna look at the Mechanize API tomorrow.
Till
davr
It will take a while, I still didn't have time to take a deep look at mechanical. I am surprised there is no port already.
Till
A: 

I know it's a little ghetto, but if you're on a *nix system, use shell_exec() with wget, which has a lot of nice options.

Lucas Oman
maybe a little TOO ghetto. that doesn't really get me anything that I couldn't do with the CURL extension, but it does make it easy for me to introduce shell injection attacks :)
davr
Oh, well yeah, I wouldn't throw user input right in there.
Lucas Oman
+1  A: 

Look into Snoopy: http://sourceforge.net/projects/snoopy/

Eli
Looks interesting, but it's pretty old (last update 2005), and while better than curl/wget, it's missing a few features that would make it nicer.
davr
+9  A: 

SimpleTest's ScriptableBrowser can be used independendly from the testing framework. I've used it for numerous automation-jobs.

troelskn
This looks pretty good, I'll have to give it a try. It has pretty much everything I'd need, only thing missing is a way to list all the links / forms on a page, but I think I could make do.
davr
You can use $browser->getUrls(). Otherwise, you can always use $dom = DomDocument::loadHtml($browser->getContent()), and then $dom->getElementsByTagName("a"), if you need more control.
troelskn
A: 

Curl is the way to go for simple requests. It runs cross platform, has a PHP extension and is widely adopted and tested.

I created a nice class that can GET and POST an array of data (INCLUDING FILES!) to a url by just calling CurlHandler::Get($url, $data) || CurlHandler::Post($url, $data). There's an optional HTTP User authentication option too :)

/**
 * CURLHandler handles simple HTTP GETs and POSTs via Curl 
 * 
 * @package Pork
 * @author SchizoDuckie
 * @copyright SchizoDuckie 2008
 * @version 1.0
 * @access public
 */
class CURLHandler
{

    /**
     * CURLHandler::Get()
     * 
     * Executes a standard GET request via Curl.
     * Static function, so that you can use: CurlHandler::Get('http://www.google.com');
     * 
     * @param string $url url to get
     * @return string HTML output
     */
    public static function Get($url)
    {
       return self::doRequest('GET', $url);
    }

    /**
     * CURLHandler::Post()
     * 
     * Executes a standard POST request via Curl.
     * Static function, so you can use CurlHandler::Post('http://www.google.com', array('q'=>'StackOverFlow'));
     * If you want to send a File via post (to e.g. PHP's $_FILES), prefix the value of an item with an @ ! 
     * @param string $url url to post data to
     * @param Array $vars Array with key=>value pairs to post.
     * @return string HTML output
     */
    public static function Post($url, $vars, $auth = false) 
    {
       return self::doRequest('POST', $url, $vars, $auth);
    }

    /**
     * CURLHandler::doRequest()
     * This is what actually does the request
     * <pre>
     * - Create Curl handle with curl_init
     * - Set options like CURLOPT_URL, CURLOPT_RETURNTRANSFER and CURLOPT_HEADER
     * - Set eventual optional options (like CURLOPT_POST and CURLOPT_POSTFIELDS)
     * - Call curl_exec on the interface
     * - Close the connection
     * - Return the result or throw an exception.
     * </pre>
     * @param mixed $method Request Method (Get/ Post)
     * @param mixed $url URI to get or post to
     * @param mixed $vars Array of variables (only mandatory in POST requests)
     * @return string HTML output
     */
    public static function doRequest($method, $url, $vars=array(), $auth = false)
    {
     $curlInterface = curl_init();

     curl_setopt_array ($curlInterface, array( 
      CURLOPT_URL => $url,
      CURLOPT_RETURNTRANSFER => 1,
      CURLOPT_FOLLOWLOCATION =>1,
      CURLOPT_HEADER => 0));
     if (strtoupper($method) == 'POST')
     {
      curl_setopt_array($curlInterface, array(
       CURLOPT_POST => 1,
       CURLOPT_POSTFIELDS => http_build_query($vars))
      ); 
     }
     if($auth !== false)
     {
        curl_setopt($curlInterface, CURLOPT_USERPWD, $auth['username'] . ":" . $auth['password']);
     }
     $result = curl_exec ($curlInterface);
     curl_close ($curlInterface);

     if($result === NULL)
     {
      throw new Exception('Curl Request Error: '.curl_errno($curlInterface) . " - " . curl_error($curlInterface));
     }
     else
     {
      return($result);
     }
    }

}

?>

[edit] Read the clarification only now... You probably want to go with one of the tools mentioned above that automates stuff. You could also decide to use a clientside firefox extension like ChickenFoot for more flexibility. I'll leave the example class above here for future searches.

SchizoDuckie
Thanks for the example though, I think this sort of wrapper might be handy for other tasks, but yeah I guess what I should have made more clear at first is that I want an automation type of thing
davr
A: 

Curl.

Curl rocks :)

Check out my clarification...curl doesn't have the featureset I'm looking for.
davr
+1  A: 

I feel compelled to answer this, even though its an old post... I've been working with PHP curl a lot and it is not as good anywhere near comparable to something like WWW:Mechanize, which I am switching to (I think I am going to go with the Ruby language implementation).. Curl is outdated as it requires too much "grunt work" to automate anything, the simpletest scriptable browser looked promising to me but in testing it, it won't work on most web forms I try it on... honestly, I think PHP is lacking in this category of scraping, web automation so its best to look at a different language, just wanted to post this since I have spent countless hours on this topic and maybe it will save someone else some time in the future.

Rick