views:

4195

answers:

7

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.

Example:

    i'm living <a href="Paris" atl="Paris link">in Paris</a>, near Paris <a href="gare">Gare du Nord</a>,  i love Paris.

would become

    i'm living.........near <a href="">Paris</a>..........i love <a href="">Paris</a>.
A: 

Regular expression:

!(<a.*</a>.*)*Paris!isU

Replacement:

$1<a href="Paris">Paris</a>

$1 referes to the first sub-pattern (at least in PHP). Depending on the language you use it could be slightly different.

This should replace all occurencies of "Paris" with the link in the replacement. It just checks whether all opening a-Tags were closed before "Paris".

PHP example:

<?php
$s = 'i\'m living <a href="Paris" atl="Paris link">in Paris</a>, near Paris <a href="gare">Gare du Nord</a>, i love Paris.'; 
$regex = '!(<a.*</a>.*)*Paris!isU'; 
$replace = '$1<a href="Paris">Paris</a>'; 
$result = preg_replace( $regex, $replace, $s); 
?>

Addition:

This is not the best solution. One situation where this regex won't work is when you have a img-Tag, which is not within an a-Element. When you set the title-Attribute of that image to "Paris", this "Paris" will be replaced, too. And that's not what you want. Nevertheless I see no way to solve your problem completely with a simple regular expression.

okoman
Are you sure about your '!' notation? Which dialect of regex are you using?
Jonathan Leffler
@Jonathan: in PHP, you can use any delimiter as long as it is the same at beginning and end. Useful to avoid escaping content...@okoman: I think you should not escape double quotes in a single quote string. And perhaps you can enhance the RE with non-greeedy match.
PhiLho
@Jonathan: I think if I'd use non-greedy match it would not not be clear that an a-Element must be closed. (Since the opening tags must occur as often as the closing ones.)I used a regex evaluator (http://regexp-evaluator.de). It generated the quoted string so it's not my fault ;-) Chaning that...
okoman
@okoman: i try your regex, but it does not match the Paris in 'near Paris'
AnhTu
@AnhTu: Well, it does. The reason that it doesn't work for you is probably, that you don't use ungreedy regex. The 'U' at the end of the regex indicates that. I don't know in which language you are trying to do this, but make sure you use ungreedy regex.
okoman
+3  A: 

Traditional answer for such question: use a real HTML parser. Because REs aren't really good at operating in a context. And HTML is complex, a 'a' tag can have attributes or not, in any order, can have HTML in the link or not, etc.

PhiLho
+4  A: 

This is hard to do in one step. Writing a single regex that does that is virtually impossible.

Try a two-step approach.

  1. Put a link around every "Paris" there is, regardless if there already is another link present.
  2. Find all incorrectly nested links (<a href="..."><a href="...">Paris</a></a>), and eliminate the inner link.

Regex for step one is dead-simple:

\bParis\b

Regex for step two is slightly more complex:

(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>

Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.

Explanation of regex #2 in plain words:

  • Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
  • Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
  • Now look for the word Paris. Save it into match group 2.
  • Look for a closing link (</a>). Make sure it is there, but don't save it.
  • Replace everything with the content of groups 1 and 2, thereby loosing everything you did not save.

The approach assumes these side conditions:

  • Your input HTML is not horribly broken.
  • Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
  • You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
  • BTW: regex #2 explicitly allows for constructs like this: "<a href="">in the <b>capital of France</b>, <a href="">Paris</a></a>". The surplus link comes from step one, replacement result of step 2 will be "<a href="">in the <b>capital of France</b>, Paris</a>".
Tomalak
A: 

If you weren't limited to using Regular expressions in this case, XSLT is a good choice for a language in which you can define this replacement, because it 'understands' XML.

You define two templates: One template finds links and removes those links that don't have "Paris" as the body text. Another template finds everything else, splits it into words and adds tags.

Tom Leys
A: 

Regexes don't replace. Languages do.

Languages and libraries would also read from the database or file that holds the list of words you care about, and associate a URL with their name. Here's the easiest substitution I can imagine possible my a single regex (perl is used for the replacement syntax.)

s/([a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/i

Proper names might work better:

s/([A-Z][a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/gi;

Of course "Baton Rouge" would become two links for:

<a href="http://en.wikipedia.org/wiki/Baton"&gt;Baton&lt;/a&gt; 
<a href="http://en.wikipedia.org/wiki/Rouge"&gt;Rouge&lt;/a&gt;

In Perl, you can do this:

my $barred_list_of_cities 
    = join( '|'
    , sort { ( length $a <=> $b ) || ( $a cmp $b ) } keys %url_for_city_of
    );
s/($barred_list_of_cities)/<a href="$url_for_city_of{$1}">$1<\/a>/g;

But again, it's a language that implements a set of operations for regexes, regexes don't do a thing. (In reality, it's such a common application, that I'd be surprised if there isn't a CPAN module out there somewhere that does this, and you just need to load the hash.

Axeman
+4  A: 

You could search for this regular expression:

(<a[^>]*>.*?</a>)|Paris

This regex matches a link, which it captures into the first (and only) capturing group, or the word Paris.

Replace the match with your link only if the capturing group did not match anything.

E.g. in C#:

resultString = 
    Regex.Replace(
        subjectString, 
        "(<a[^>]*>.*?</a>)|Paris", 
        new MatchEvaluator(ComputeReplacement));

public String ComputeReplacement(Match m) {
    if (m.groups(1).Success) {
        return m.groups(1).Value;
    } else {
        return "<a href=\"link to paris\">Paris</a>";
    }
}
Jan Goyvaerts
A: 
  $pattern = 'Paris';
  $text = 'i\'m living <a href="Paris" atl="Paris link">in Paris</a>,  near Paris <a href="gare">Gare du Nord</a>,  i love Paris.';

  // 1. Define 2 arrays:
  //  $matches[1] - array of links with our keyword
  //  $matches[2] - array of keyword
  preg_match_all('@(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)|(?<!\pL)('.$pattern.')(?!\pL)@', $text, $matches);

  // Exists keywords for replace? Define first keyword without tag <a>
  $number = array_search($pattern, $matches[2]);

  // Keyword exists, let's go rock
  if ($number !== FALSE) {

    // Replace all link with temporary value
    foreach ($matches[1] as $k => $tag) {
      $text = preg_replace('@(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)@', 'KEYWORD_IS_ALREADY_LINK_'.$k, $text, 1);
    }

    // Replace our keywords with link
    $text = preg_replace('/(?<!\pL)('.$pattern.')(?!\pL)/', '<a href="">'.$pattern.'</a>', $text);

    // Return link
    foreach ($matches[1] as $k => $tag) {

      $text = str_replace('KEYWORD_IS_ALREADY_LINK_'.$k, $tag, $text);
    }

    // It's work!
    echo $text;
  }
faost