Answers using 'uniq' suffer from the problem that 'uniq' only finds adjacent duplicated lines, or the data file is sorted losing positional information.
If no line may ever be repeated, then it is relatively simple to do in Perl (or other scripting language with regex and associative array support), assuming that the data source is not incredibly humungous:
#!/bin/perl -w
# BEWARE: untested code!
use strict;
my(%lines);
while (<>)
{
print if !defined $lines{$_};
$lines{$_} = 1;
}
However, if it is used indiscriminately, this is likely to break the XML since end tags are legitimately repeated. How to avoid this? Maybe by a whitelist of 'OK to repeat' lines? Or maybe only lines with open tags with values are subject to duplicate elimination:
#!/bin/perl -w
# BEWARE: untested code!
use strict;
my(%lines);
while (<>)
{
if (m%^\s*<[^\s>]+\s[^\s>]+%)
{
print if !defined $lines{$_};
$lines{$_} = 1;
}
else
{
print;
}
}
Of course, there is also the (largely valid) argument that processing XML with regular expressions is misguided. This coding assumes the XML comes with lots of convenient line breaks; real XML may not contain any, or only a very few.