views:

94

answers:

5

I'm banging my head against the wall trying to figure out a (regexp?) based parser rule for the following problem. I'm developing a text markup parser similar to textile (using PHP), but i don't know how to get the inline formatting rules correct -- and i noticed, that the textile parsers i found are not able to format the following text as i would like to get it formatted:

-*deleted* -- text- and -more deleted text-

The result I want to have is:

<del><strong>deleted</strong> -- text</del> and <del>more deleted text</del>

What I do not want is:

<del><strong>deleted</strong> </del>- text- and <del>more deleted text</del>

Any ideas are very appreciated! thanks very much!

UPDATE

i think i should have mentioned, that '-' should still be a valid character (hyphen) :) -- for example the following should be possible:

-american-football player-

expected result:

<del>american-football player</del>
+1  A: 

For a single token, you can simply match:

-((?:[^-]|--)*)-

and replace with:

<del>$1</del>

and similarly for \*((?:[^*]|\*{2,})*)\* and <strong>$1</strong>.

The regex is quite simple: literal - in both ends. In the middle, in a capturing group, we allow anything that isn't an hyphen, or two hyphens in a row.

To also allow single dashes in words, as in objective-c, this can work, by accepting dashes surrounded by two alphanumeric letters:

-((?:[^-]|--|\b-\b)*)-
Kobi
Nice, I would up vote but I'm out of votes for today. =\
Alix Axel
Ok, for this example it works -- but '-' should be still a valid character in the text. for example "-objective-c-" should become "<del>objective-c</del>".
harald
@harald - well, you didn't mention you need it :)
Kobi
you are right :)
harald
A: 

You could try something like:

'/-.*?[^-]-\b/'

Where the ending hyphen must be at a word boundary and preceded by something that is not a hyphen.

Josiah
A: 

I think you should read this warning sign first You can't parse [X]HTML with regex

Perhaps you should try googling for a php html library

Sjuul Janssen
And perhaps you should stop giving pro forma answers and read the questions...
Alix Axel
A valid comment would be that you cannot match nested quotes, or in this case `*` and `-`, for example `- aa * bb - cc - bb * aa-`.
Kobi
+1  A: 

The strong tag is easy:

$string = preg_replace('~[*](.+?)[*]~', '<strong>$1</strong>',  $string);

Working on the others.


Shameless hack for the del tag:

$string = preg_replace('~-(.+?)-~', '<del>$1</del>', $string);
$string = str_replace('<del></del>', '--', $string);
Alix Axel
That'd be `str_replace('</del><del>', '--', $string);`. I guess that's the problem with hacks :)
Kobi
@Kobi: Oh! Didn't even noticed that! Your solution is way better and the OP should use it. I had a very similar one but couldn't get the non-capturing group to work... I'm out of patience today - been awake for 22 hrs. :P
Alix Axel
+2  A: 

Based of the RedCloth library's parser description, with some modification for double-dash.

@
  (?<!\S)               # Start of string, or after space or newline
  -                     # Opening dash
  (                     # Capture group 1
    (?:                 #   : (see note 1)
      [^-\s]+           #   :
      [-\s]+            #   :
    )*?                 #   :
    [^-\s]+?            #   :
  )                     # End
  -                     # Closing dash
  (?![^\s!"\#$%&',\-./:;=?\\^`|~[\]()<])  # (see note 2)
@x
  • Note 1: This should match up to the next dash lazily, while consuming any non-single dashes, and single dashes surrounded by whitespace.
  • Note 2: Followed by space, punctuation, line break or end of string.

Or compacted:

@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&',\-./:;=?\\^`|~[\]()<])@

A few examples:

$regex = '@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&\',\-./:;=?\\\^`|~[\]()<])@';
$replacement = '<del>\1</del>';

preg_replace($regex, $replacement, '-*deleted* -- text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-*deleted*--text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-american-football player-'), "\n";

Will output:

<del>*deleted* -- text</del> and <del>more deleted text</del>
<del>*deleted*</del>-text- and <del>more deleted text</del>
<del>american-football player</del>

In the second example, it will match just -*deleted*-, since there are no spaces before the --. -text- will not be matched, because the initial - is not preceded by a space.

MizardX