About your string regex: you say it is a string if and only if it is preceded
by a white space character or a (
and it is directly followed by a ,
or )
.
Needles to say, that is not correct. You'd miss strings like:
$s = "123"; // ends with a ;
$s = "ab\"cd"; // contains an escaped double quote
$t = 'efg' ; // is surrounded by single quotes
to name just three (there are many more, and what about 'here-docs'?).
To account fix the cases above, try something like this:
$line = 's = "123"; t = "ab\\\\\\"cd"; u = \'efg\' ; v = \'ef\\\'g\' ';
echo $line . "\n";
echo preg_replace('/((["\'])(?:\\\\.|(?:(?!\2).|[^\\\\"\'\r\n]))*\2)/', '<span class="string">$1</span>', $line);
/* output:
s = "123"; t = "ab\\\"cd"; u = 'efg' ; v = 'ef\'g'
s = <span class="string">"123"</span>; t = <span class="string">"ab\\\"cd"</span>; u = <span class="string">'efg'</span> ; v = <span class="string">'ef\'g'</span>
*/
A short explanation:
( # start group 1
(["\']) # match a single- or double quote and store it in group 2
(?: # start non-matching group 1
\\\\. # match a double quote followed by any character (except line breaks)
| # OR
(?: # start non-matching group 2
(?!\2). # a character other than what is captured in group 2
| # OR
[^\\\\"\'\r\n] # any character except a backslash, double quote, single quote or line breaks
) # end non-matching group 2
)* # end non-matching group 1 and match it zero or more times
\2 # the quote captured in group 2
) # end group 1
Then some comments about your second regex: you first try to match zero or more
white space characters. This can safely be omitted because if no white spaces exist
you'd still have a match. You could use a \b
(word boundary) before matching the
function name. Also, (?:[a-z]|[0-9]|_)
can be replaced by (?:[a-z0-9_])
. And
this part of your regex: (@?|!?[a-z]+(?:[a-z]|[0-9]|_)*)
which is the same as:
(
@?
|
!?
[a-z]+
(?:
[a-z]
|
[0-9]
|
_
)*
)
only better indented to see what it actually does. If you look closely, you will
see that it will match just @?
, and since the @
is made optional by the ?
, that
part of your regex will match an empty string as well. No what you'd expected, eh?
After that, I must confess I stopped looking at that regex any more, better throw it
away.
Try something like this to match function names:
'/\b[a-z_][a-z0-9_]*(?=\s*\()/i'
Which means:
\b # a word boundary (the space between \w and \W)
[a-z_] # a letter or an underscore
[a-z0-9_]* # a letter, digit or an underscore, zero or more times
(?= # start positive look ahead
\s* # zero ore more white space characters
\( # an opening parenthesis
) # end positive look ahead
This last one is not tested at all, I leave that for you. Also note that I know very little PHP, so I may be over-simplifying it, in which case it would help if you provide a couple of example code snippets you want to match as functions.
Furthermore a word of caution, parsing code using regex-es can be tricky, but if
you're only using it to perform highlighting of small snippets of code, you
should be fine. When the source files get larger, you might see a drop in
performance and you should make some parts of your regex-es "possessive" which will
increase the runtime of your matching considerately (especially on larger source files).
Lastly, you're probably reinventing the wheel. There exist numerous (well tested) code-highlighters you can use. I suspect you already know this, but I thought it would still be worth mentioning.
FYI, I've had good experience with this one: http://shjs.sourceforge.net/doc/documentation.html
Good luck!