views:

187

answers:

2

I'm trying to set up some exotic PHP code (I'm not an expert), and I get a FastCGI Error 500 on a PHP line containing 'preg_match_all'.

When I comment out the line, the page is returned with a 200 (but not how it was meant to be).

The code is parsing PHP, HTML and JavaScript content loaded from the database and is composing them to return the finished page.

Now, by placing around some error_log entries I could determine that the line with the preg_match_all is the cause of the 500. However the line is hit multiple times during the loading of the page and on other occasions, the line does not cause an error.

Here's how it looks like exactly:

preg_match_all ("/(<([\w]+)[^>]*>)((?:.|\n)*)(<\/\\2>)/",
                $part['data'], $tags, PREG_PATTERN_ORDER|PREG_OFFSET_CAPTURE);

The subject string is a piece of text that looks like:

<script> ... some javascript functions ... </script>

Edit: This is code that is up and running correctly elsewhere, so this very well could be a PHP setting or environment difference. I'm using PHP 5.2.13 on IIS6 with FastCGI.

Edit: Nothing is mentioned in the log files. At least not in the ones I checked:

  • IIS Logs
  • Event Logs
  • PHP Log

Edit: jab11 has pointed out the problem, but there's no solution yet:

Any thoughts or direction would be welcome.

+2  A: 

Any chance that $part['data'] might be extremely big? I used to get 500 error on preg_match_all when I used it on strings bigger than 100 KB.

jab11
It is indeed pretty big. But it does not crash on another environment, with the same PHP.INI setup.What is different is: IIS6 vs II7, WinSrv2003 vs WinSrv2008. I'll have to double check the PHP and FastCgi Versions...
Bertvan
Big might be exaggerated :), when saving the string a text, its about 32kB
Bertvan
This will probably be the right direction to look. The string is the biggest in the batch, and has a magic length of 32206.However this still does not solve the problem... Could there be a setting or something that will fix this issue?
Bertvan
+1  A: 

This is a wonderful example why it's a bad idea to process HTML with regular expressions. I'm willing to bet you're running into a Stack Overflow because the HTML source string is containing some unclosed tags, making the regex try all sorts of permutations in its futile attempt to find a closing tag (</\2>). In an HTML file of 32 KB, it's easy to throw your regex off the trolley. Perhaps the stack is a different size on a different server so it works on one but not the other.

A quick test:

I applied the regex to the source code of this page (after having removed the closing </html> tag). RegexBuddy promptly went catatonic for about a minute before then matching the <head> and <body> tags (successfully). Debugging the regex from <html> on showed that it took the regex engine 970257 steps to find out that it couldn't match.

Tim Pietzcker
The way things are done in this codebase are indeed no where near a best practice or even a good idea. Most of it is really bad and unmaintainable. As I said in the question, it is an exotic codebase which I'm trying to set up in a test environment.So, I did not choose to get this malwritten PHP website running myself. My boss told me I have to do it, these things happen.But, thanks for you answer, I'll go look at the content of what is being parsed now :)
Bertvan