tags:

views:

303

answers:

6

Hi, I'm really new to Bash, so this could sound silly to most of you. I'm trying to get a list of some filenames from a text file. Tried to do this with sed and awk, but couldn't get it to work with my limited knowledge.

This is a sample file content:

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948)  -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"&gt;
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
 width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
 xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</tspan></text>
</svg>

What I would like to get from this sample is a new text file with this exact content:

/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf /Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf /Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

I thought telling sed to print all the matching entries between 'font-size"10">' and '</tspan>' but... the best I got was a file with the whole line contaning my field delimiters.

If you could explain each step done, would be great.

  • The filenames could be more or less. This 3 are just an example.
+1  A: 

How about this:

cat file.xml | sed -e's/^[^>]*>//' -e's/<.*$//' | grep \\.

It's not very general-purpose, but to be fully general would be A LOT more complicated (XML requires a full parser, etc.).

Basically, the sed script has two parts. First, strip off all characters from beginning of line (^) to the first ">" character. Note that I match all non ">" in order to do that. The second part strips off all characters from the left most "<" character to the end of line. Since this second part comes AFTER the first part, it's done after the first stripping is done, that's why it doesn't erase the whole line.

Then, the grep statement returns only lines with a "." in them, which is only the lines with filenames remaining.

Hope that helps!

Chris Harris
sed -e's/^[^>]*>//' -e's/<.*$//' file.xml | grep \\.extra cat! extra cat! /me points.
hometoast
A: 
nik
A: 

Sed and awk are generally not the right way to read XML. They may work, but the XML can change layout at any time and break things, while still being perfectly valid XML.

Much better is to use something like Perl. Install the XML::Smart module either via CPAN, or on ubunto with "sudo apt-get install libxml-smart-perl".

Then a simple script like this:

use strict;
use diagnostics;

use XML::Smart;

my $xml = XML::Smart->new ("svg.xml") || die "Cannot read XML: $!.";
my $version = $xml->{svg}{version} || die "Cannot determine SVG version.";

foreach my $file ($xml->{svg}{text}{tspan}('@')) {
    print $file->content . "\n";
}

Save it as svg.pl. Save your XML as svg.xml.

$ perl svg.pl /Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf /Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf /Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

This:

  • Parses the XML, checking it is correct.
  • Checks that the version exists (just a sanity check really).
  • Loops through an array of all svg/text/tspans and prints the content.

Have fun!

the.jxc
A: 
anschauung
A: 

If you have xmlgawk, you can get easily.

@load xml

BEGIN {
    XMLMODE = 1;
    XMLCHARSET = "utf-8";
}

XMLCHARDATA {
    data = $0;
}

XMLENDELEM == "tspan" {
    print data;
}

and

$ xgawk -f pick_from_svg.awk sample.xml 
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf
A: 
awk 'BEGIN{RS="font-size=\"10\">|</tspan>"}/pdf/' xml.txt

Result

$ awk 'BEGIN{RS="font-size=\"10\">|"}/pdf/' xml.txt
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

This code is probably the simplest one yet with no messy regex and it is very extensible and easy for you to adjust it to your likings. I decided to match against the term 'pdf' hence the /pdf/ portion of the code but if, for example, you had other files that you want to match that aren't pdf's but do contain the word 'Volumes' you can simply use /Volumes/ instead.

SiegeX