views:

77

answers:

2

I need to let the user specify a custom format for a function which uses vsprintf, and since PHP doesn't have glibc' register_printf_function(), I'll have to do it with PCRE.

My question is, what would be the best REGEXP to match % followed by any character and not having % before it, in an usable manner for programmatic use afterwards?

The closest solution I could get was:

<?php

function myprintf($format,$args) {
 $matches = array();
 preg_match_all('/((?<!%)%*[^%]+)/', $format,$matches,PREG_OFFSET_CAPTURE|PREG_PATTERN_ORDER);
 print_r($matches);
}

myprintf("begin%a%%b%%%c%d",NULL);

Which kinda works, BUT this gets "confused" by inputs like "%%%c". I would like to have series of two %-signs (that is, escaped) in one grouping, like:

Array (
 0 => '%%',
 1 => '%c'
)

and not like it's doing it now: Array ( 0 => '%%%c' ) That is, I need to keep the input intact, though tokenized, in order to join the pieces together after I do the processing of the custom printf formats I encounter in the input.

Thanks,

Flavius

PS: the "user" is actually another programmer. I am aware of the security implications.

+1  A: 

If what you want is a % followed by a letter or another % then you can simply do:

$string = "begin%a%%b%%%c%d";
preg_match_all("/%./", $string, $matches);
$values = $matches[0];

// $values = array(5) { [0]=> string(2) "%a" [1]=> string(2) "%%" [2]=> string(2) "%%" [3]=> string(2) "%c" [4]=> string(2) "%d" }

// begin %a %% b %% %c %d <- is the string with spaces.

Edit:

I think this is equivalent to what you want from the comments below:

preg_match_all('/(\s?\w+\s?|%[^%]|%%)/', $string, $matches);
$value = $matches[0];

// $value = array(7) { [0]=> string(5) "begin" [1]=> string(2) "%a" [2]=> string(2) "%%" [3]=> string(1) "b" [4]=> string(2) "%%" [5]=> string(2) "%c" [6]=> string(2) "%d" }

The main difference is that [2]=> string(3) "%%b" becomes [2]=> string(2) "%%" [3]=> string(1) "b" which should give you the same results because the %% would be evaluated as a single % anyways.

null
A great idea which should fit my needs. I wonder though if there's a way to preg_match_all() and get the entire input tokenized correctly.Thus I'll vote it up, but not accept yet.
Flavius
What I basically need to tell it could also be something like "an % not preceded by more than another %. If it's not preceded, then match forwards until a % is found. Otherwise start a new match"
Flavius
So what exactly would the output be... based on the example? Would it be array(3) { [0]=> string(2) "%a" [1]=> string(2) "%c" [2]=> string(2) "%d" }
null
the input "begin%a%%b%%%c%d" should be tokenized as: 0 => begin 1 => %a 2 => %%b 3 => %% 4 => %c 5 => %d
Flavius
So basically I don't care about the specifiers or modifiers and not even if they're valid, vsprintf() will take care of that. All I care about is parsing escaped %-characters correctly, and having <some-text> after an unescaped %-char as in "%<some-text>". That's all.
Flavius
It should be relatively simple for someone experienced at look-arounds, conditionals and backreferences.
Flavius
I've got even closer with/(?(?<!%)%{0,2}[^%]*|%%)/Unfortunately this leaves out the "%c" after matching the double-"%%" before it.Please tag my question as regex and pcre too!
Flavius
+1  A: 

Code:

$string = "begin%a%%b%%%c%d";
preg_match_all('/([^%]|%%)+|%.*?[a-zA-Z]/', $string, $matches);
print_r($matches[0]);

Output:

Array
(
    [0] => begin
    [1] => %a
    [2] => %%b%%
    [3] => %c
    [4] => %d
)

This should parse compound format specifiers like %.3f or %$1d properly as well, also.

John Kugelman