tags:

views:

73

answers:

1

Hi,

I am trying to get a working regular expression to convert standard HTML code to a custom format (needed for data export).

For exemple within the following code :

<a href="toto.php">Toto
</a> bwahaha
<td width="49%" bgcolor="#FF9E39" style="padding-left: 10px; padding-top: 3px; padding-bottom: 3px; border-bottom: 5px solid rgb(255, 255, 255);" class="texteblanc">
<a href="nuit-orientation.php" class="texteblanc">[strong]Nuit de l'orientation[/strong]</a>
</td>

I would like to extract the two links in the following format :

[a:toto.php]Toto[/a]
[a:nuit-orientation.php][strong]Nuit de l'orientation[/strong][/a]

And of course I want the links to be kept in place within the existing HTML code.

So, I tryed the following code :

$txt = preg_replace('/<a href="(([[:word:]]|[[:punct:]])+)"[^>]*>\n*(\r\n)*\r*(([[:print:]]|\r\n|\n)+)\n*(\r\n)*\r*<\/a>/i', '[a:${1}]${4}[/a]', $txt);

It works but not all the time...

Does someone have any idea of how to do something like this ?

Thanks,

Damien

+2  A: 

Don't use regex to parse HTML!

Use an HTML parser.

Skilldrick
If you use the dom you can get all the elements with attribute href="blah" very easily.
matpol
I need to export static text from PHP pages :- The "design" is made of tables- I need to export datas from several designs without the same layout- All the needed datas are directly written within the PHPSo : I replace all the tags I want to kept with a syntax like I shown on my post and I remove all the other tags with strip_tags().The goal of all of this is to make a XML export of the static pages to import them into a CMS (eZ Publish).I tryed loading the page into a DOMDocument but I wil not be able to find where the datas are except with a lot of exceptions...
MARTIN Damien
@MARTIN regardless of exactly which HTML manipulation system you use (DOM, SimpleXML, SimpleHTMLDOM, etc) any of these will still be a *much better* solution than trying to solve this with regular expressions? Why? You *can't* - HTML **is not** a regular language.
Peter Bailey
Ok, I will look up for another way to extract the content of my page.Thanks for your answers.
MARTIN Damien