tags:

views:

55

answers:

3

Hello,

I am trying to match HTML tags that might occur between words on a web page, using regex's.

For example, if the sentence that I want to match is "This is a word", I need to develop a pattern that will match something like "This is a <b>word</b>".

I've tried using the code below to prepare the regex pattern:

$pattern = "/".str_replace(" ", .{0,100}, $sentence)."/si";

This replaces all spaces by .{0,100} and uses the s modifier to match any character. However, I am getting undesired results with this.

Thanks in advance for any help with this!

A: 

Try to use ereg_replace() or preg_replace() function when you are trying to perform a regular expression search and replace.

chanchal1987
ereg_replace is depreciated. Better to stick to preg_replace.
Brad F Jacobs
What the hell? No one told me ereg_replace had gone down in value! ;)
TheDeadMedic
@TheDeadMedic, Sell Sell Sell!
Mike Sherov
A: 

I put this together very quickly, so it probably doesn't cover all edge cases, but I think it at least partially matches your requirements. Also, I haven't tried it in PHP.

/[^\s>]+[\s]*(<([^>]+)>)(.*)(</\2>)[\s]*[^\s<]+/g

In the following example:

<p>This is a <b><i>nice</i> sentence</b>.</p> <p>Here's another sentence.</p>

It only matches the first sentence, in the following groups:

  1. <b>
  2. b
  3. <i>nice</i> sentence
  4. b
Paul Lammertsma
A: 

What are you actually trying to achieve? Parsing an html document with regex might not be the best solution. You can use XPath for what you've described (so far).
E.g. finding all rows in a table that contain the text this is a word:

<?php
$doc = new DOMDocument;
$doc->loadhtml('<html><head><title>...</title></head><body>
  <table>
    <tr><td>1</td><td>lalala</td></tr>
    <tr><td>2</td><td>this is a <b>word</b></td></tr>
    <tr><td>3</td><td>lalala</td></tr>
    <tr><td>4</td><td><b>And this is a</b> word, too</td></tr>
  </table>
</body></html>');

$xpath = new DOMXPath($doc);
foreach($xpath->query('/html/body/table/tr[./td[contains(., "this is a word")]]') as $tr) {
  foreach($tr->childNodes as $td) {
    echo $td->nodeValue, ' ';
  }
  echo "\n";
}

prints

2 this is a word 
4 And this is a word, too 
VolkerK