views:

370

answers:

6

I would like modify HTML like

I am <b>Sadi, novice</b> programmer.

to

I am <b>Sadi, learner</b> programmer.

To do it I will search using a string "novice programmer". How can I do it please? Any idea?

It search using more than one word "novice programmer". It could be a whole sentence. The extra white space (e.g. new line, tab) should be ignored and any tag must be ignored during the search. But during the replacement tag must be preserved.

It is a sort of converter. It will be better if it is case insensitive.

Thank you

Sadi


More clarification:

I get some nice reply with possible solution. But please keep posting if you have any idea in mind.

I would like to more clarify the problem just in case anyone missed it. Main post shows the problem as an example scenario.

1) Now the problem is find and replace some string without considering the tags. The tags may shows up within a single word. String may contain multiple word. Tag only appear in the content string or the document. The search phrase never contain any tags.

We can easily remove all tags and do some text operation. But here the another problem shows up.

2) The tags must be preserve, even after replacing the text. That is what the example shows.

Thank you Again for helping

A: 
dclowd9901
That would remove the HTML formatting completely, and the post was specifically about *keeping* HTML formatting.
Matti Virkkunen
Yeah, just noticed that. Sorry for the mixup.
dclowd9901
+1  A: 

Well, there might be a better way, but off the top of my head (assuming that tags won't appear in the middle of words, HTML is well-formed, etc.)...

Essentially, you'll need three things (sorry if this sounds patronising, not intended that way): 1. A method of sub-string matching that ignores tags. 2. A way of making the replacement preserving the tags. 3. A way of putting it all together.

1 - This is probably the most difficult bit. One method would be to iterate through all of the characters in the source string (strings are basically arrays of characters so you can access the characters as if they are array elements), attempting to match as many characters as possible from the search string, stopping when you've either matched all of the characters or run out of characters to match. Any characters between and including '<' and '>' should be ignored. Some pseudo-code (check this over, it's late and there may be mistakes):

findMatch(startingPos : integer, subject : string, searchString : string){
    //Variables for keeping track of characters matched, positions, etc.
    inTag = false;
    matchFound = false;
    matchedCharacters = 0;
    matchStart = 0;
    matchEnd = 0;

    for(i from startingPos to length(searchString)){
        //Work out when entering or exiting tags, ignore tag contents
        if(subject[i] == '<' || subject[i] == '>'){
            inTag = !inTag;
        }
        else if(!inTag){
            //Check if the character matches expected in search string
            if(subject[i] == searchString[matchedCharacters]){
                if(!matchFound){
                    matchFound = true;
                    matchStart = i;
                }
                matchedCharacters++;

                //If all of the characters have been matched, return the start and end positions of the substring
                if(matchedCharacters + 1 == length(searchString)){
                    matchEnd = i - matchStart;
                    return matchStart, matchEnd;
                }
            }
            else{
                //Reset counts if not found
                matchFound = false;
                matchCharacters = 0;
            }
        }
    }
    //If no full matches were found, return error
    return -1;
}

2 - Split the HTML source code into three strings - the bit you want to work on (between the two positions returned by the matching function) and the part before and after. Split up the bit you want to modify using, for example:

$parts = preg_split("/(<[^>]*>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);

Keep a record of where the tags are, concatenate the non-tag segments and perform substring replace on this as normal, then split the modified string up again and reassemble with the tags in place.

3 - This is the easy part, just concatenate the modified part and the other two bits back together.

I may have horribly over complicated this mind, if so just ignore me.

Moonshield
Sadi
+3  A: 

I would do this:

if (preg_match('/(.*)novice((?:<.*>)?\s(?:<.*>)?programmer.*)/',$inString,$attributes) {
  $inString = $attributes[1].'learner'.$attributes[2];
}

It should match any of the following:

novice programmer
novice</b> programmer
novice </b>programmer
novice<span> programmer

A test version of what the regex states would be something like: Match any set of characters until you reach "novice" and put it into a capturing group, then maybe match something that starts with a '<' and has any number of characters after it and then ends with '>' (but don't capture it), but then there only match something with a white space and then maybe match again something that starts with a '<' and has any number of characters after it and then ends with '>' (but don't capture it) which must then be followed by programmer followed by any number of characters and put that into a capture group.

I would do some specific testing though, as I may have missed some stuff. Regex is a programmers best friend!

Kitson
It is very hard coded, but may be a possible solution, thank you
Sadi
One more thing, novice was also replaced, just you can not see the affect as both word (search-replace) are same "novice".
Sadi
No, it isn't `preg_replace`... it is `preg_match`, it will only trigger if the pattern is matched and the capture groups are moved into $attributes and then reassembled into the desired string. As far as the hard coding, it was to give you what you were looking for, but regular expressions can be adapted to whatever you really need.
Kitson
"I am <b>Sadi, novice</b> programmer. I am simple. I am <b>Sadi, novice</b> programmer. I am simple" -- Not working properly with this string, Here the result occur twice. I have tried with preg_match_all and preg_match. And it never replace the programmer. It keeps it as it is.Any Idea please?
Sadi
A: 

Interesting problem.

I would use the DOM and XPath to find the closest nodes containing that text and then use substring matching to find out which bit of the string is in what node. That will involve character-per-character matching and possible backtracking, though.

Here is the first part, finding the container nodes:

<?php
error_reporting(E_ALL);
header('Content-Type: text/plain; charset=UTF-8');

$doc = new DOMDocument();
$doc->loadHTML(<<<EOD
<p>
    <span>
        <i>
            I am <b>Sadi, novice</b> programmer.
        </i>
    </span>
</p>
<ul>
    <li>
        <div>
            I am <em>Cornholio, novice</em> programmer of television shows.
        </div>
    </li>
</ul>
EOD
);
$xpath = new DOMXPath($doc);
// First, get a list of all nodes containing the text anywhere in their tree.
$nodeList = $xpath->evaluate('//*[contains(string(.), "programmer")]');
$deepestNodes = array();
// Now only keep the deepest nodes, because the XPath query will also return HTML, BODY, ...
foreach ($nodeList as $node) {
    $deepestNodes[] = $node;
    $ancestor = $node;
    while (($ancestor = $ancestor->parentNode) && ($ancestor instanceof DOMElement)) {
        $deepestNodes = array_filter($deepestNodes, function ($existingNode) use ($ancestor) {
            return ($ancestor !== $existingNode);
        });
    }
}
foreach ($deepestNodes as $node) {
    var_dump($node->tagName);
}

I hope that helps you along.

janmoesen
"That will involve character-per-character matching and possible backtracking, though." Though it sounds good, it may not good solution for production environment. But I will take a look at your solution. Thank you
Sadi
A: 

Since you didn't give exact specifics on what you will use this for, I will use your example of "I am sadi, novice programmer".

$before = 'I am <b>sadi, novice</b> programmer';
$after = preg_replace ('/I am (<.*>)?(.*), novice(<.*>)? programmer/','/I am $1$2,     learner$3 programmer/',$string);

Alternatively, for any text:

$string = '<b>Hello</b>, world!';
$orig = 'Hello';
$replace = 'Goodbye';
$pattern = "/(<.*>)?$orig(<.*>)?/";
$final = "/$1$replace$2/";
$result = preg_replace($pattern,$final,$string);
//$result should now be 'Goodbye, world!'

Hope that helped. :d

Edit: An example of your example, with the second piece of code: $string = 'I am sadi, novice programmer.';
$orig = 'novice';
$replace = 'learner';
$pattern = "/(<.>)?$orig(<.>)?/";
$final = "$1$replace$2";
$result = htmlspecialchars(preg_replace($pattern,$final,$string));
echo $result;

The only problem is if you were searching for something that was more than a word long.

Edit 2: Finally came up with a way to do it across multiple words. Here's the code:

function htmlreplace($string,$orig,$replace) 
 {
  $orig = explode(' ',$orig);
  $replace = explode(' ',$replace);
  $result = $string;
  while (count($orig)>0)
   {
    $shift = array_shift($orig);
    $rshift = array_shift($replace);

    $pattern = "/$shift\s?(<.*>)?/";
    $replacement = "$rshift$1";
    $result = preg_replace($pattern,$replacement,$result);
   }
  $result .= implode(' ',$replace);
  return $result;
 }

Have fun! :d

Hussain
Please look at the example. It search using more than one word "novice programmer". It could be a whole sentence. The extra white space (e.g. new line, tab) and any tag should be ignored during the search.
Sadi
Um, I don't think it's taking into consideration whitespace... Another fix coming on the way, gim a few minutes.
Hussain
not working properly. It works like replace by word. Even replace by word not working always. example: $inString = 'I am <b>Sadi, novice</b> programmer. I am simple. I am <b>Sadi, novice</b> programmer. I am simple programmer';echo htmlreplace($inString, 'novice programmer', 'lame developer'); Result: I am Sadi, lame developer. I am simple. I am Sadi, novice developer. I am simple developer
Sadi
+3  A: 

ok i think this is what you want. it takes your input search and replace, splits them into arrays of strings delimited by space, generates a regexp that finds the input sentence with any number of whitespace/html tags, and replaces it with the replacement sentence with the same tags replaced between the words.

if the wordcount of the search sentence is higher than that of the replacement, it just uses spaces between any extra words, and if the replacement wordcount is higher than the search, it will add all 'orphaned' tags on the end. it also handles regexp chars in the find and replace.

<?php
function htmlFriendlySearchAndReplace($find, $replace, $subject) {
    $findWords = explode(" ", $find);
    $replaceWords = explode(" ", $replace);

    $findRegexp = "/";
    for ($i = 0; $i < count($findWords); $i++) {
        $findRegexp .= preg_replace("/([\\$\\^\\|\\.\\+\\*\\?\\(\\)\\[\\]\\{\\}\\\\\\-])/", "\\\\$1", $findWords[$i]);
        if ($i < count($findWords) - 1) {
            $findRegexp .= "(\s?(?:<[^>]*>)?\s(?:<[^>]*>)?)";
        }
    }
    $findRegexp .= "/i";

    $replaceRegexp = "";
    for ($i = 0; $i < count($findWords) || $i < count($replaceWords); $i++) {
        if ($i < count($replaceWords)) {
            $replaceRegexp .= str_replace("$", "\\$", $replaceWords[$i]);
        }
        if ($i < count($findWords) - 1) {
            $replaceRegexp .= "$" . ($i + 1);
        } else {
            if ($i < count($replaceWords) - 1) {
                $replaceRegexp .= " ";
            }
        }
    }

    return preg_replace($findRegexp, $replaceRegexp, $subject);
}
?>

here are the results of a few tests :

Original : <b>Novice Programmer</b>
Search : Novice Programmer
Replace : Advanced Programmer
Result : <b>Advanced Programmer</b>

Original : Hi, <b>Novice Programmer</b>
Search : Novice Programmer
Replace : Advanced Programmer
Result : Hi, <b>Advanced Programmer</b>

Original : I am not a <b>Novice</b> Programmer
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a <b>Advanced</b> Programmer

Original : Novice <b>Programmer</b> in the house
Search : Novice Programmer
Replace : Advanced Programmer
Result : Advanced <b>Programmer</b> in the house

Original : <i>I am not a <b>Novice</b> Programmer</i>
Search : Novice Programmer
Replace : Advanced Programmer
Result : <i>I am not a <b>Advanced</b> Programmer</i>

Original : I am not a <b><i>Novice</i> Programmer</b> any more
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a <b><i>Advanced</i> Programmer</b> any more

Original : I am not a <b><i>Novice</i></b> Programmer any more
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a <b><i>Advanced</i></b> Programmer any more

Original : I am not a Novice<b> <i> </i></b> Programmer any more
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a Advanced<b> <i> </i></b> Programmer any more

Original : I am not a Novice <b><i> </i></b> Programmer any more
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a Advanced <b><i> </i></b> Programmer any more

Original : <i>I am a <b>Novice</b> Programmer</i> too, now
Search : Novice Programmer too
Replace : Advanced Programmer
Result : <i>I am a <b>Advanced</b> Programmer</i> , now

Original : <i>I am a <b>Novice</b> Programmer</i>, now
Search : Novice Programmer
Replace : Advanced Programmer Too
Result : <i>I am a <b>Advanced</b> Programmer Too</i>, now

Original : <i>I make <b>No money</b>, now</i>
Search : No money
Replace : Mucho$1 Dollar$
Result : <i>I make <b>Mucho$1 Dollar$</b>, now</i>

Original : <i>I like regexp, you can do [A-Z]</i>
Search : [A-Z]
Replace : [Z-A]
Result : <i>I like regexp, you can do [Z-A]</i>
oedo
I like the solution. But here is little bug.$inString = 'I am <b>Sadi, novice</b> programmer. I am simple. I am <b>Sadi, novice</b> programmer. I am simple';echo htmlFriendlySearchAndReplace('Novice programmer', 'lame developer', $inString);Result is: I am Sadi, lame programmer. I am simple. I am Sadi, novice developer. I am simple
Sadi
sorry, edited answer to fix. change this line : `$findRegexp .= "(\s?(?:<[^>]*>)?\s(?:<[^>]*>)?)";`
oedo
Thank you, now it is working very well. Only remain problem is it can not work if it found tag in the middle of the word. e.g. Novi<b>ce</b> And of-course it is quite difficult to solve as we can not determine easily the position of the tag. If you can please post the solution of it.
Sadi
You may move the tag forward or backward :) Thank you very much for the solution. I have tried similar solution (as your function) but failed because I am bad with regex :(
Sadi
Urrghh!!!! I can not accept the answer :( The accept button has gone :( May be because of the bounty... But it is the best solution
Sadi
that's very strange. now the bounty has gone too? only 10 points for that answer then :(
oedo
you got 10 for my up vote... not the bounty... but your answer work out of the box.... :(
Sadi
oh well, no worries. glad i could help anyway.
oedo
Hi oedo, Woul you please help me here (http://stackoverflow.com/questions/2728288/split-string-into-smaller-part-with-constrain-php-regex-html) with your skill of regex :)
Sadi
At last this answer was accepted by me somehow.... thanks again Oedo :)
Sadi