ansaurus

Question

Get a list of delimited filenames from a text file

Answer 1

+1 A:

How about this:

cat file.xml | sed -e's/^[^>]*>//' -e's/<.*$//' | grep \\.

It's not very general-purpose, but to be fully general would be A LOT more complicated (XML requires a full parser, etc.).

Basically, the sed script has two parts. First, strip off all characters from beginning of line (^) to the first ">" character. Note that I match all non ">" in order to do that. The second part strips off all characters from the left most "<" character to the end of line. Since this second part comes AFTER the first part, it's done after the first stripping is done, that's why it doesn't erase the whole line.

Then, the grep statement returns only lines with a "." in them, which is only the lines with filenames remaining.

Hope that helps!

Chris Harris 2009-06-30 02:33:13

sed -e's/^[^>]*>//' -e's/<.*$//' file.xml | grep \\.extra cat! extra cat! /me points.

hometoast 2009-06-30 02:35:41

Answer 2

A:

nik 2009-06-30 02:58:32

Answer 3

A:

Sed and awk are generally not the right way to read XML. They may work, but the XML can change layout at any time and break things, while still being perfectly valid XML.

Much better is to use something like Perl. Install the XML::Smart module either via CPAN, or on ubunto with "sudo apt-get install libxml-smart-perl".

Then a simple script like this:

use strict;
use diagnostics;

use XML::Smart;

my $xml = XML::Smart->new ("svg.xml") || die "Cannot read XML: $!.";
my $version = $xml->{svg}{version} || die "Cannot determine SVG version.";

foreach my $file ($xml->{svg}{text}{tspan}('@')) {
    print $file->content . "\n";
}

Save it as svg.pl. Save your XML as svg.xml.

$ perl svg.pl /Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf /Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf /Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

This:

Parses the XML, checking it is correct.
Checks that the version exists (just a sanity check really).
Loops through an array of all svg/text/tspans and prints the content.

Have fun!

the.jxc 2009-06-30 03:13:22

Answer 4

A:

anschauung 2009-06-30 03:38:41

Answer 5

A:

If you have xmlgawk, you can get easily.

@load xml

BEGIN {
    XMLMODE = 1;
    XMLCHARSET = "utf-8";
}

XMLCHARDATA {
    data = $0;
}

XMLENDELEM == "tspan" {
    print data;
}

and

$ xgawk -f pick_from_svg.awk sample.xml 
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

2009-07-01 12:29:17

Answer 6

A:

awk 'BEGIN{RS="font-size=\"10\">|</tspan>"}/pdf/' xml.txt

Result

$ awk 'BEGIN{RS="font-size=\"10\">|"}/pdf/' xml.txt
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

This code is probably the simplest one yet with no messy regex and it is very extensible and easy for you to adjust it to your likings. I decided to match against the term 'pdf' hence the /pdf/ portion of the code but if, for example, you had other files that you want to match that aren't pdf's but do contain the word 'Volumes' you can simply use /Volumes/ instead.

SiegeX 2009-12-12 05:23:20

ansaurus

tags:

views:

answers:

Get a list of delimited filenames from a text file

Result

related questions