views:

44

answers:

3

Hi!

I want to develope simple web crawler, to grabb pages from several web sites and maintain them in actual condition. Some of this sites has session ids on each link, they doesn't store sesion ids in cookies at all. So, if i will parse site several times - my parsing table will containts dublicate pages with difference only in session id.

So my question is: how can I remove session id from all links, is there some intelligent idea? I'm developing on php, but all other platforms solutions will be useful, even just alhoritm on words.

A: 

You can always use a regular expression for matching session keys, they're typical most of the time (PHPSESSID). Anyways, if you're crawling something and would like to accept and work with cookies, you should use cURL (see curl_setopt COOKIE, COOKIEFILE and COOKIEJAR).

kovshenin
i seen several unique session keys, of course i can log this keys and use this info in next crawlings, but it isn't universal solution
hippout
maybe you can read it from the cookie HTTP headers, anyways @Hannes seems to have a better solution, since 32-char sessions string are more difficult to change than the session key name, yet it's still not 100%
kovshenin
+1  A: 

As an Example, if you wanna use an RegEx this would remove all Sessions from your url (as long as they have 32 chars, which is the usual I guess):

$url = preg_replace('#([\w\d]+=[\w\d]{32})#',null,$url);

Hannes
it is as usual as the variable name. usual, but not always. see session.hash_bits_per_character. thought +1 for regexp
Col. Shrapnel
Apart from [session.hash_bits_per_character](http://php.net/session.configuration.php#ini.session.hash-bits-per-character) see also [session.hash_function](http://php.net/session.configuration.php#ini.session.hash-function). And this only covers standard PHP session IDs.
Gumbo
yepp, i think as someone really customizes them its nearly impossible to fetch them
Hannes
+2  A: 

You can use parse_str() and http_build_query() to extract, clear and rebuild the URL parameters. You can use regular expressions, but I think it would just be easier to get an array of the URL params to work with.

parse_str('session=123445&data=example&action=demo', $url_params);
// $url_params is not an associative array of the url params
unset($url_params['session'], $url_params['action']);
$new_url_param_string = http_build_query($url_params);
Brent Baisley