views:

237

answers:

2

I'm trying to convert from HTML to Latex, and want to change this:

<a href="www.foo.com/bar">baz</a>

into:

baz\footnote{www.foo.com/bar}

I'd like to generate a Clojure function to take a chunk of text, and replace as many matches as exist in a given paragraph.

I've tried

(.replaceAll 
    "<a href=\"foo.com\">baz</a>" 
    "<a.*href=\"(.*)\">(.*)</a>" 
    "\2\\footnote{\1}")

but that returns:

"^Bfootnote{^A}"

I've also looked at clojure.contrib.str-utils2, which has a replace function that uses regular expressions, but it doesn't seem to handle backreferences. Am I missing something? Going about this the wrong way? Any help is appreciated.

+4  A: 

(You should not parse HTML with a regex...)

Two things:

  1. Java uses $1, $2 to refer to capture groups, not \1, \2.

  2. You need more backslashes in the replacement text. The first level of backslashing is consumed by the Clojure reader because it's a literal string. The second level of backslashing is consumed by the regex. Unfortunately Clojure doesn't have a general syntax for "raw" String literals (yet?). The Clojure literal regex syntax #"" does some magic to save you some backslashes, but normal Strings don't have that magic.

So:

user> (.replaceAll "<a href=\"www.foo.com/bar\">baz</a>"
                   "<a.*href=\"(.*)\">(.*)</a>"
                   "$2\\\\footnote{$1}")
"baz\\footnote{www.foo.com/bar}"

You can also do it this way:

user> (require '(clojure.contrib [str-utils2 :as s]))
nil
user> (s/replace "<a href=\"www.foo.com/bar\">baz</a>"
                 #"<a.*href=\"(.*)\">(.*)</a>"
                 (fn [[_ url txt]]
                     (str txt "\\\\footnote{" url "}")))
"baz\\footnote{www.foo.com/bar}"

"\2" is a control character (ASCII character 2) which is why it's displayed as ^B. Nearly the same as doing (char 2).

Brian Carper
Is there a reason to choose the .replaceAll over the s/replace option, or vice versa? It seems they both should work, but does one have higher processing requirements, or is one more idiomatically Clojure?Given equal functionality, what's the best practice?
Andrew Larned
`clojure.contrib.str-utils2/replace` does more (you can pass in an fn as the third argument). But it's an added dependency for your project. It's idiomatic to use either, you don't have to shy away from making Java calls. Personally I use `str-utils` for most things.
Brian Carper
+1  A: 

And if you want to be really spiffy, you go for clojure.xml. It will return a tree of structures you can modify as you like. Your above example would look like this:

{:tag :a :attrs {:href "www.foo.com/bar"} :content ["bar"]}

This can be easily translated to something like:

["bar" {:footnote "www.foo.com/bar"}]

which can be easily serialised back to your desired form. And the best part: No unmaintainable regexes. :) YMMV of course.....

kotarak