tags:

views:

45

answers:

5

Hi I am trying to create a script that will count the number of characters between xml tags and idealy group by these values before returning the variations:

eg

<CONTEXT_1>aaaa<CONTEXT_1>
<CONTEXT_2>bb<CONTEXT_2>
<CONTEXT_2>dfgh<CONTEXT_2>
<CONTEXT_6>bb<CONTEXT_6>
<CONTEXT_1>bbbb<CONTEXT_1>

the result of this would be

<CONTEXT_1> 4
<CONTEXT_2> 2,4
<CONTEXT_6> 4

Any help would be much appreciated! I'm totally stuck

Thanks M

A: 

This is a job for Awk: a full-featured text processing language.

Something like (not tested):

awk \
"BEGIN { $INIT_TAB_AWK } \
{ split(\$0, tab, \"\"); \
for (chara in tab) \
{ for (chara2 in tab_search) \
{ if (tab_search[chara2] == tab[chara]) { final_tab[chara2]++ } } } } \
END { for (chara in final_tab) \
{ print tab_search[chara] \" => \" final_tab[chara] } }"
vulkanino
A: 

You can do something like this using sed:

sed  's/^<\([^>]*\)>\(.*\)<.*$/\1 \2/g' file.xml | sort | while read line
do
    context=`echo $line | cut -d' ' -f1`
    count=`echo $line | cut -d' ' -f2 | tr -d '\n' | wc -c`
    echo $context: $count
done | uniq

which prints:

CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2
dogbane
A: 

1. Use XML-specific utilities

I think that any command line tool designed to work with XML is better than custom awk/sed hacks. Scripts using such tools are more robust and do not break when the XML input is slightly reformatted (e.g. it doesn't matter where line breaks are and how the document is indented). My tool of choice for XML querying from the command line is xmlstarlet.

2. Fix your XML

Then, you need to fix your XML: close tags properly and add a root element. Something like this:

<root>
<CONTEXT_1>aaaa</CONTEXT_1>
<CONTEXT_2>bb</CONTEXT_2>
<CONTEXT_2>dfgh</CONTEXT_2>
<CONTEXT_6>bb</CONTEXT_6>
<CONTEXT_1>bbbb</CONTEXT_1>
</root>

3. Use XPath and XSLT

Select the elements you need with XPath and process them with XSLT expressions. In your example, you can count elements' length with

$ xmlstarlet sel -t -m '//root/*' -v "name(.)" -o ": " -v "string-length(.)" -n test.xml 

//root/* selects all child nodes of the root. name(.) prints element name of the currently selected element, and string-length(.) prints the length of its contents.

And get the output:

CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2
CONTEXT_1: 4

Group results as you like with awk or similar tools.

jetxee
A: 
$ awk -F">" '{sub("<.*","",$2);a[$1]=a[$1]","length($2)}END{for (i in a) print i,a[i]}' file
<CONTEXT_6 ,2
<CONTEXT_1 ,4,4
<CONTEXT_2 ,2,4
ghostdog74
A: 

Using Perl:

#! /bin/perl    
open FILE, $ARGV[0] or die $!;
while (my $line = <FILE>) {
        if ($line =~ /^<([^>]*)>(.*)<.*$/) {
            $table{$1}="$table{$1},".length($2);
         }
}    
foreach my $key (sort keys %table) {
  print "$key ".substr($table{$key},1)."\n";
}

Output is:

CONTEXT_1 4,4
CONTEXT_2 2,4
CONTEXT_6 2
dogbane