ansaurus

Question

unix script to count the number of characters between particular xml tags

Answer 1

A:

This is a job for Awk: a full-featured text processing language.

Something like (not tested):

awk \
"BEGIN { $INIT_TAB_AWK } \
{ split(\$0, tab, \"\"); \
for (chara in tab) \
{ for (chara2 in tab_search) \
{ if (tab_search[chara2] == tab[chara]) { final_tab[chara2]++ } } } } \
END { for (chara in final_tab) \
{ print tab_search[chara] \" => \" final_tab[chara] } }"

vulkanino 2010-09-21 08:05:05

Answer 2

A:

You can do something like this using sed:

sed  's/^<\([^>]*\)>\(.*\)<.*$/\1 \2/g' file.xml | sort | while read line
do
    context=`echo $line | cut -d' ' -f1`
    count=`echo $line | cut -d' ' -f2 | tr -d '\n' | wc -c`
    echo $context: $count
done | uniq

which prints:

CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2

dogbane 2010-09-21 08:49:13

Answer 3

A:

1. Use XML-specific utilities

I think that any command line tool designed to work with XML is better than custom awk/sed hacks. Scripts using such tools are more robust and do not break when the XML input is slightly reformatted (e.g. it doesn't matter where line breaks are and how the document is indented). My tool of choice for XML querying from the command line is xmlstarlet.

2. Fix your XML

Then, you need to fix your XML: close tags properly and add a root element. Something like this:

<root>
<CONTEXT_1>aaaa</CONTEXT_1>
<CONTEXT_2>bb</CONTEXT_2>
<CONTEXT_2>dfgh</CONTEXT_2>
<CONTEXT_6>bb</CONTEXT_6>
<CONTEXT_1>bbbb</CONTEXT_1>
</root>

3. Use XPath and XSLT

Select the elements you need with XPath and process them with XSLT expressions. In your example, you can count elements' length with

$ xmlstarlet sel -t -m '//root/*' -v "name(.)" -o ": " -v "string-length(.)" -n test.xml

//root/* selects all child nodes of the root. name(.) prints element name of the currently selected element, and string-length(.) prints the length of its contents.

And get the output:

CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2
CONTEXT_1: 4

Group results as you like with awk or similar tools.

jetxee 2010-09-21 09:07:50

Answer 4

A:

$ awk -F">" '{sub("<.*","",$2);a[$1]=a[$1]","length($2)}END{for (i in a) print i,a[i]}' file
<CONTEXT_6 ,2
<CONTEXT_1 ,4,4
<CONTEXT_2 ,2,4

ghostdog74 2010-09-21 11:44:49

Answer 5

A:

Using Perl:

#! /bin/perl    
open FILE, $ARGV[0] or die $!;
while (my $line = <FILE>) {
        if ($line =~ /^<([^>]*)>(.*)<.*$/) {
            $table{$1}="$table{$1},".length($2);
         }
}    
foreach my $key (sort keys %table) {
  print "$key ".substr($table{$key},1)."\n";
}

Output is:

CONTEXT_1 4,4
CONTEXT_2 2,4
CONTEXT_6 2

dogbane 2010-09-21 12:16:18

ansaurus

tags:

views:

answers:

unix script to count the number of characters between particular xml tags

1. Use XML-specific utilities

2. Fix your XML

3. Use XPath and XSLT

related questions