views:

96

answers:

2

Hello, I am creating a project, and I need to be able to use a regex(or if something else is preferable?)

Basically, I need to convert a PHPish markup code page so that the "non-code" is converted into "code." For instance:

Orginal:

<?code
  echo 'some text';
?>
<head>
</head>
<body>
</body>
<?code
  echo '</html>';
?>

Converted:

<?code
  echo '<html>';
  echo '
<head>
</head>
<body>
</body>';
  echo '</html>';
?>

How could this work while also taking quotes into account? (like <?code $var='<?code stuff ?>';?>

Also, if someone provided me with something to detect included files, (to replace with something that first "prepossesses" the file then includes it) (where the includes are similar to PHP)

Is this even possible with Regex? I know your not suppose to try to parse HTML with regex, but this isn't trying to parse it, it's really being quite dumb to how the markup and everything is..

Also, this project will actually be implemented in Ruby(the preprocessor that is), so if there is something Ruby has that would aid in this, then have at it.

I know the code looks very similar to PHP, but thats because it is, but it will not be implemented in PHP and the "code" used won't actually be PHP, but it will use a <? type mechanism for containing code in markup.

Edit: also note that the language inside the markup can for all practical purposes be Ruby. So it can contain quotes and comments that have the closing code tag.

A: 

More a couple of ideas rather than an answer:

I would suggest you try to find some regex that can find the blocks of PHP and then wrap everything else in your echo's instead of the other way round.

Another option may be to look at the PHP tokenizer, but i'm not sure how it deals with sections of HTML outside of the tags I'm afraid.

Jake Worrell
How about capturing this PHP block: `<?php echo 'no closing tag: ?>'; /* also no closing tag ?> */ ?>`
Bart Kiers
Hmm.. good point.. I guess it'll just have to be a hybrid parser.. Replacing all the markup appropriately and parsing everything in `<?php` to catch tricks like this.
Earlz
Fair point, perhaps the tokenizer might be worth looking into then.
Jake Worrell
Indeed, troelskn's answer is the way to go in my opinion.
Bart Kiers
+3  A: 

You can use token_get_all to get a stream of parser tokens. Loop through them and echo them out, when you come upon a T_INLINE_HTML, you can then rewrite it to an echo statement instead.

Edit - Just saw you say you're using Ruby. Obviously, you can't use PHP's tokeniser from within Ruby. Maybe you can call php over the command line?

Edit 2:

Is this even possible with Regex? I know your not suppose to try to parse HTML with regex, but this isn't trying to parse it, it's really being quite dumb to how the markup and everything is..

It's parsing alright. You can use regexp to split your input into tokens (aka tokenization). Since most languages are contextual, you will then have to feed the tokens to a state machine, which can parse the code into an internal representation (an AST). This can then be transformed into your target output. It sounds elaborate and scary, but it's really quite simple when you have tried it a couple of times. I suggest that you work through it, with the help of Wikipedia and Google.

troelskn
Nah, that's not what I'm going for(and the actual code in the markup won't be PHP).. Sorry, changed my question to better reflect my intentions.
Earlz
Well, not what I was wanting.. but guess it's the answer :( (leave the question open a bit longer just in case though)
Earlz
Keep in mind that you don't need to write a parser that recognises the entire language. It's enough to tokenise into the parts that has context which is relevant to what you're looking to manipulate. Eg. Split by comment-delimiters, string literal-delimiters, backslashes and the actual markers that you are searching for. That makes for a fairly simple state machine.
troelskn