views:

65

answers:

2

I have many html documents, I need to replace the text "foo" to "bar" in all documents, except in links

For example

foo<a href="foo.com">foo</a>

should be raplaced to

bar<a href="foo.com">bar</a>

the url in the link (foo.com) should be left untouched.

The same case in image links and links to javascripts or stylesheets, only the text should be replaced, the urls should be unchanged.

Any ideas for a nice regex or something ? :)

I can use Ruby too :)

+1  A: 

I'd recommend using hpricot, which will let you perform actions on the inner_html of elements only. You'll need something more than a regex to get what you want.

Peter
Good idea, it works ! thanks :)
astropanic
A: 

Regular expressions cannot parse HTML. Use a tool such as XSLT that's up to the job:

<?xml version="1.0"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="//text()[name(..) != 'script']">
    <xsl:call-template name="replace-foo" />
  </xsl:template>

  <xsl:template name="replace-foo">
    <xsl:param name="text" select="." />
    <xsl:choose>
      <xsl:when test="contains($text, 'foo')">
        <xsl:value-of select="substring-before($text, 'foo')"/>
        <xsl:text>bar</xsl:text>
        <xsl:call-template name="replace-foo">
          <xsl:with-param name="text" select="substring-after($text, 'foo')"/>
        </xsl:call-template>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="$text"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
</xsl:stylesheet>

With the following input

<html>
<head><title>Yo!</title></head>
<body>
<!-- foo -->
foo<a href="foo.com">foo</a>
<script>foo</script>
</body>
</html>

you'd get

$ xsltproc replace-foo.xsl input.html
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Yo!</title>
</head>
<body>
<!-- foo -->
bar<a href="foo.com">bar</a>
<script>foo</script>
</body>
</html>
Greg Bacon