views:

40

answers:

2
$regexp = '/(?:<input\stype="hidden"\sname="){1}([a-zA-Z0-9]*)(?:"\svalue="1"\s\/>)/';
$response = '<input type="hidden" name="7d37dddd0eb2c85b8d394ef36b35f54f" value="1" />';
preg_match($regexp, $response, $matches);

echo $matches[1]; // Outputs: 7d37dddd0eb2c85b8d394ef36b35f54f

So I'm using this regular expression to search for an authentication token on a webpage implementing Joomla in order to preform a scripted login.

I've got all this working but am wondering what is wrong with my regular expression as it always returns 2 items.

Array ( [0] => [1] => 7d37dddd0eb2c85b8d394ef36b35f54f)

Also the name of the input I'm checking for changes every page load both in length and name.

A: 

As per the manual entry for preg_match:

If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

Joshua Rodgers
And I guess the `[0] => [1]` is in fact (in the webpage source) `[0] => '<input type="hidden" name="7d37dddd0eb2c85b8d394ef36b35f54f" value="1" />' [1] =>`
Arkh
+3  A: 

Nothing is wrong. Item [0] always contains the entire match. From the docs (emphasis mine):

If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

Your regex (overlooking the fact that you are working on HTML with regexes in the first place, which you know you shouldn't) is a bit too complicated.

$regexp = '#<input\s+type="hidden"\s+name="([0-9a-f]*)"\s+value="1"\s*/>#i'
  • You don't need the non-capturing groups at all.
  • You use \s, which limits you to a single character. \s+ is probably better.
  • Using something different than / as the regex boundary makes escaping of forward slashes in the regex unnecessary.
  • Making the regex case-insensitive could be useful, too.
  • The auth token looks like a hex string, so matching a-z is unnecessary.
Tomalak
Thank you. Your regex does seem easier to read and I know your not supposed to do html matching with regex, but this seemed like a great case for it.
Ballsacian1
@Ballsacian1: It's your funeral. ;-) Looking into DOMDocument::loadHTML and tackling this problem with DOM and XPath might be worthwhile nevertheless.
Tomalak