I'm writing a complex script that takes the XML backup of a Blogger blog and converts it to InDesign Tagged Text to be laid out in a book. I'm using a whole bunch of regular expressions to clean out the HTML tags of each blog post and convert them to InDesign tags. For example:
<p>A really long paragraph.</p> -> <ParaStyle:Main text>A really long paragraph.
<em>Whatever</em> -> <CharStyle:Italic>Whatever<CharStyle:>
For the most part the script is working great. However, InDesign can't handle nested tags. <CharStyle:Small><CharStyle:Italic>This is small italic text<CharStyle:><CharStyle:>
will not work and needs to end up as <CharStyle:Small italic>This is small italic text<CharStyle:>
I'm trying to use variables in regex search patterns to find anywhere where character style tags are doubled up, but when I use the variables, nothing is found. If I hard code the InDesign tags into the regex, though, it works. What is making the variables unfindable?
Here's a working excerpt from my code (in real life $input
isn't a string variable, but a LibXML object that the script parses...this is just for illustration)
#!/usr/bin/perl -w
use strict;
my $IDitalic = "<~~CharStyle:Italic>";
my $IDsmall = "<~~CharStyle:Small>";
my $IDsmallitalic = "<~~CharStyle:Small italic>";
my $IDcharend = "<~~CharStyle:>";
sub cleanText {
my $text = $_[0];
# Replace any span with a font size attribute with "small" character style
$text =~ s/<span[^>]*?font-size[^>]*>(.*?)<\/span>/$IDsmall$1$IDcharend/gis;
# Replace <em> tags with "italic" character style
$text =~ s/<em>(.*?)<\/em>/$IDitalic$1$IDcharend/gis;
#--------------------------------------------------------
# Problem section
#
# The following works since everything is hard coded
# $text =~ s/<~~CharStyle:Small><~~CharStyle:Italic>/$IDsmallitalic/gi;
# $text =~ s/<~~CharStyle:><~~CharStyle:>/$IDcharend/gi;
# When I use variables, though, it doesn't work...
$text =~ s/{$IDsmall}{$IDitalic}/$IDsmallitalic/gi;
$text =~ s/({$IDcharend})\1+/$1/gi;
#--------------------------------------------------------
# Clear out all tags that aren't the InDesign tags, take out the dummy ~~ and rebuild the actual tag
$text =~ s/<[^~~](?:[^>'"]*|(['"]).*?\1)*>//gs;
$text =~ s/<~~/</gs;
return $text;
}
my $input = "<~~ParaStyle:Main text>In sodales malesuada nisi quis varius. Proin a ligula mauris. Proin ac justo est, vitae sollicitudin tortor. Proin auctor, <span style=\"font-size:78%\">augue eu</span> fringilla imperdiet, nisi sapien tempus libero, sed aliquet quam metus vel risus. Curabitur feugiat tristique porttitor. Integer malesuada volutpat accumsan. <span class=\"dummy\"In egestas</span> metus ut erat placerat tempus. <em>Nam vestibulum</em>, est quis scelerisque tincidunt, enim est lacinia ligula, vel accumsan ante nisl consectetur massa. Nullam velit nisi, viverra quis viverra ac, dictum ac enim. Sed nisl magna, fringilla at placerat quis, facilisis id nibh. Mauris eget sapien mauris, nec sollicitudin urna. Curabitur ac nunc a arcu vulputate tincidunt.\n<~~ParaStyle:Main text><span style=\"font-size:78%\"><em>**This is really small text</em></span>\n<ParaStyle:Comments\:Comment author>Andrew\n<~~ParaStyle:Comments\:Comment date>Friday, May 29, 2009— 8:15 PM";
print cleanText($input);
So, what's going wrong?
Also, is there a better way to maintain the InDesign tags without having the dummy tildes in the variable name?
Thanks!
Author apparently decided to parse the HTML, for more information go to the parsing follow-up question.